Data Determines Distributional Robustness in Contrastive Language Image Pre-training (CLIP)
Authors: Alex Fang, Gabriel Ilharco, Mitchell Wortsman, Yuhao Wan, Vaishaal Shankar, Achal Dave, Ludwig Schmidt
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We answer this question via a systematic experimental investigation. Concretely, we study five different possible causes for the robustness gains: (i) the training set size, (ii) the training distribution, (iii) language supervision at training time, (iv) language supervision at test time, and (v) the contrastive loss function. Our experiments show that the more diverse training distribution is the main cause for the robustness gains, with the other factors contributing little to no robustness. |
| Researcher Affiliation | Collaboration | 1University of Washington 2Amazon 3Allen Institute for Artificial Intelligence. |
| Pseudocode | No | The paper describes methods and procedures in narrative text but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | No | The paper provides a link for the Image Net-Captions dataset but does not explicitly state that the source code for the methodology described in the paper is available or provide a link to it. |
| Open Datasets | Yes | First, we introduce Image Net-Captions, a new dataset for training on paired language-image data. Image Net-Captions augments 463,622 of the 1.2 million images in the Image Net 2012 training set (Russakovsky et al., 2015) with the original text data sourced from the corresponding Flickr images. Image Net-Captions enables controlled experiments comparing standard (class-based) Image Net training with language-image training on the same set of images. |
| Dataset Splits | No | The paper uses well-known datasets and evaluates on test distributions (e.g., ImageNet, ImageNet V2), but it does not explicitly specify the training/validation/test dataset splits with percentages or sample counts for the data used in their experiments, nor does it refer to specific predefined splits from cited works for their particular use case. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. |
| Software Dependencies | No | The paper mentions certain libraries (e.g., 'better-profanity library', 'profanity-check library') and optimizers, but it does not provide specific version numbers for these or other key software dependencies like programming languages or deep learning frameworks. |
| Experiment Setup | Yes | CLIP experiments are trained with cross-entropy losses using Adam W optimizer with initial learning rate of 0.001 and a cosine-annealing learning rate schedule with 500 warmup steps. Hyperparameters for Adam W are set at β1 = 0.9, β2 = 0.999, and ϵ =1e-8. The batch size is set to 1024. CLIP models trained on Image Net-Captions are trained for 32 epochs, while Cl IP models trained on all of Image Net are trained for 90 epochs. |