An Inverse Scaling Law for CLIP Training
Authors: Xianhang Li, Zeyu Wang, Cihang Xie
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 Image Net-1k accuracies of 63.2% in 2 days. 67.8% in 3 days, and 69.3% in 4 days. Our method also works well when scaling up with G/14, we register a new record of 83.0% Image Net-1k zero-shot accuracy, and meanwhile accelerate the training by 33 compared to its Open CLIP counterpart. |
| Researcher Affiliation | Academia | Xianhang Li* Zeyu Wang* Cihang Xie equal contribution UC Santa Cruz |
| Pseudocode | No | The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'. |
| Open Source Code | Yes | Our code is available at https://github.com/UCSC-VLAA/CLIPA. |
| Open Datasets | Yes | We train our models on the LAION-400M [52] dataset for 6.4 epochs, equivalent to 2,000 Image Net-1k epochs; this is then followed by a 0.36-epoch fine-tuning stage on full-resolution images (224 224) with a maximum text length of 32. |
| Dataset Splits | No | The paper describes pre-training and fine-tuning stages on LAION-400M and LAION-2B, and evaluates on Image Net-1k zero-shot accuracy, but does not provide specific train/validation/test dataset splits with percentages or sample counts for the training datasets themselves. |
| Hardware Specification | Yes | For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 Image Net-1k accuracies of 63.2% in 2 days. |
| Software Dependencies | No | We implement two codebases based on JAX [5] and Pytorch [39] respectively. Our JAX codebase is built on Big Vision [4] and our pytorch code base mainly followed Open CLIP [24]. The paper mentions general software tools (JAX, PyTorch, Big Vision, Open CLIP) but does not provide specific version numbers for these dependencies. |
| Experiment Setup | Yes | Our training setup largely follows FLIP [29]. We use the vanilla Vi T [18] as our visual encoder and the non-autoregressive Transformer [56] architecture as our text encoder. We train our models on the LAION-400M [52] dataset for 6.4 epochs, equivalent to 2,000 Image Net-1k epochs; this is then followed by a 0.36-epoch fine-tuning stage on full-resolution images (224 224) with a maximum text length of 32. To ensure effective contrast between training samples, we set the batch size to 32k. We apply a base learning rate of 8e-6 in the main training stage and 4e-7 in the fine-tuning stage. Gradient Checkpointing [8] is used to conserve GPU/TPU memory. Our data augmentation includes a simple random resizing crop with a minimum cropping ratio of 40%. Detailed hyperparameter settings and model configurations can be found in the appendix. |