Demystifying CLIP Data

Authors: Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experimental study rigorously isolates the model and training settings, concentrating solely on data. Meta CLIP applied to Common Crawl with 400M image-text data pairs outperforms CLIP s data on multiple standard benchmarks. In zero-shot Image Net classification, Meta CLIP achieves 70.8% accuracy, surpassing CLIP s 68.3% on Vi T-B models.We present an empirical study on data curation, with frozen model architecture and training schedule. We focus solely on the impact of training data, excluding other factors that could confound the results.
Researcher Affiliation Collaboration Hu Xu1 Saining Xie2 Xiaoqing Ellen Tan1 Po-Yao Huang1 Russell Howes1 Vasu Sharma1 Shang-Wen Li1 Gargi Ghosh1 Luke Zettlemoyer1,3 Christoph Feichtenhofer1 1FAIR, Meta AI 2New York University 3University of Washington
Pseudocode Yes We provide the Python pseudo-code in Algorithm 1. Algorithm 1: Pseudo-code of Curation Algorithm in Python style (see Sec. A.10 for samples).
Open Source Code Yes Curation code and training data distribution over metadata is available at https://github.com/facebookresearch/Meta CLIP.
Open Datasets Yes Meta CLIP applied to Common Crawl (CC) with 400M data points outperforms CLIP on multiple standard benchmarks. We adopt Common Crawl (CC)4 as the source to build such a pool (footnote 4 links to https://commoncrawl.org).
Dataset Splits No The paper mentions 'training/validation data has been de-duplicated' for Image Net zero-shot classification but does not provide specific ratios, counts, or methodologies for how the primary Common Crawl dataset was split into training and validation sets for their own model pre-training.
Hardware Specification Yes We strictly follow the CLIP training setup, using V100 32GB GPUs and an equivalent global batch size of 32,768. For Vi T-B/32 and Vi T-B/16, we use 64 GPUs with a per GPU batch size of 512 and for Vi T-L/14 we use 128 GPUs with a 256 per GPU batch size... We use 256 A100 80GB GPUs to train Vi T-H/14 and Vi T-big G/14 model for 1 week and 2 months, respectively.
Software Dependencies No The paper mentions 'Python pseudo-code' but does not specify any software dependencies with version numbers (e.g., PyTorch, TensorFlow, scikit-learn, CUDA versions).
Experiment Setup Yes Training Setup We strictly follow the CLIP training setup, using V100 32GB GPUs and an equivalent global batch size of 32,768. For Vi T-B/32 and Vi T-B/16, we use 64 GPUs with a per GPU batch size of 512 and for Vi T-L/14 we use 128 GPUs with a 256 per GPU batch size... We train in all experiments for the same number of iterations that correspond to 12.8B seen image-text pairs during training (32 epochs for 400M). Table 12: Hyperparameters of Open AI CLIP / Meta CLIP ... Seen Pairs 12.8B(400M 32 epochs) ... Batch Size 32768 ... Learning Rate 5.0e-4(B/32,B/16), 4.0e-4(L/14) ... Warm-up 2k