An Inverse Scaling Law for CLIP Training

Authors: Xianhang Li, Zeyu Wang, Cihang Xie

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 Image Net-1k accuracies of 63.2% in 2 days. 67.8% in 3 days, and 69.3% in 4 days. Our method also works well when scaling up with G/14, we register a new record of 83.0% Image Net-1k zero-shot accuracy, and meanwhile accelerate the training by 33 compared to its Open CLIP counterpart.
Researcher Affiliation Academia Xianhang Li* Zeyu Wang* Cihang Xie equal contribution UC Santa Cruz
Pseudocode No The paper does not contain any sections or figures explicitly labeled 'Pseudocode' or 'Algorithm'.
Open Source Code Yes Our code is available at https://github.com/UCSC-VLAA/CLIPA.
Open Datasets Yes We train our models on the LAION-400M [52] dataset for 6.4 epochs, equivalent to 2,000 Image Net-1k epochs; this is then followed by a 0.36-epoch fine-tuning stage on full-resolution images (224 224) with a maximum text length of 32.
Dataset Splits No The paper describes pre-training and fine-tuning stages on LAION-400M and LAION-2B, and evaluates on Image Net-1k zero-shot accuracy, but does not provide specific train/validation/test dataset splits with percentages or sample counts for the training datasets themselves.
Hardware Specification Yes For example, using 8 A100 GPUs, our CLIP models achieve zero-shot top-1 Image Net-1k accuracies of 63.2% in 2 days.
Software Dependencies No We implement two codebases based on JAX [5] and Pytorch [39] respectively. Our JAX codebase is built on Big Vision [4] and our pytorch code base mainly followed Open CLIP [24]. The paper mentions general software tools (JAX, PyTorch, Big Vision, Open CLIP) but does not provide specific version numbers for these dependencies.
Experiment Setup Yes Our training setup largely follows FLIP [29]. We use the vanilla Vi T [18] as our visual encoder and the non-autoregressive Transformer [56] architecture as our text encoder. We train our models on the LAION-400M [52] dataset for 6.4 epochs, equivalent to 2,000 Image Net-1k epochs; this is then followed by a 0.36-epoch fine-tuning stage on full-resolution images (224 224) with a maximum text length of 32. To ensure effective contrast between training samples, we set the batch size to 32k. We apply a base learning rate of 8e-6 in the main training stage and 4e-7 in the fine-tuning stage. Gradient Checkpointing [8] is used to conserve GPU/TPU memory. Our data augmentation includes a simple random resizing crop with a minimum cropping ratio of 40%. Detailed hyperparameter settings and model configurations can be found in the appendix.