Demystifying CLIP Data
Authors: Hu Xu, Saining Xie, Xiaoqing Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, Christoph Feichtenhofer
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experimental study rigorously isolates the model and training settings, concentrating solely on data. Meta CLIP applied to Common Crawl with 400M image-text data pairs outperforms CLIP s data on multiple standard benchmarks. In zero-shot Image Net classification, Meta CLIP achieves 70.8% accuracy, surpassing CLIP s 68.3% on Vi T-B models.We present an empirical study on data curation, with frozen model architecture and training schedule. We focus solely on the impact of training data, excluding other factors that could confound the results. |
| Researcher Affiliation | Collaboration | Hu Xu1 Saining Xie2 Xiaoqing Ellen Tan1 Po-Yao Huang1 Russell Howes1 Vasu Sharma1 Shang-Wen Li1 Gargi Ghosh1 Luke Zettlemoyer1,3 Christoph Feichtenhofer1 1FAIR, Meta AI 2New York University 3University of Washington |
| Pseudocode | Yes | We provide the Python pseudo-code in Algorithm 1. Algorithm 1: Pseudo-code of Curation Algorithm in Python style (see Sec. A.10 for samples). |
| Open Source Code | Yes | Curation code and training data distribution over metadata is available at https://github.com/facebookresearch/Meta CLIP. |
| Open Datasets | Yes | Meta CLIP applied to Common Crawl (CC) with 400M data points outperforms CLIP on multiple standard benchmarks. We adopt Common Crawl (CC)4 as the source to build such a pool (footnote 4 links to https://commoncrawl.org). |
| Dataset Splits | No | The paper mentions 'training/validation data has been de-duplicated' for Image Net zero-shot classification but does not provide specific ratios, counts, or methodologies for how the primary Common Crawl dataset was split into training and validation sets for their own model pre-training. |
| Hardware Specification | Yes | We strictly follow the CLIP training setup, using V100 32GB GPUs and an equivalent global batch size of 32,768. For Vi T-B/32 and Vi T-B/16, we use 64 GPUs with a per GPU batch size of 512 and for Vi T-L/14 we use 128 GPUs with a 256 per GPU batch size... We use 256 A100 80GB GPUs to train Vi T-H/14 and Vi T-big G/14 model for 1 week and 2 months, respectively. |
| Software Dependencies | No | The paper mentions 'Python pseudo-code' but does not specify any software dependencies with version numbers (e.g., PyTorch, TensorFlow, scikit-learn, CUDA versions). |
| Experiment Setup | Yes | Training Setup We strictly follow the CLIP training setup, using V100 32GB GPUs and an equivalent global batch size of 32,768. For Vi T-B/32 and Vi T-B/16, we use 64 GPUs with a per GPU batch size of 512 and for Vi T-L/14 we use 128 GPUs with a 256 per GPU batch size... We train in all experiments for the same number of iterations that correspond to 12.8B seen image-text pairs during training (32 epochs for 400M). Table 12: Hyperparameters of Open AI CLIP / Meta CLIP ... Seen Pairs 12.8B(400M 32 epochs) ... Batch Size 32768 ... Learning Rate 5.0e-4(B/32,B/16), 4.0e-4(L/14) ... Warm-up 2k |