Exploring the Limits of Large Scale Pre-training
Authors: Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work we systematically study this phenomena and establish that, as we increase the upstream accuracy, performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and Res Nets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, Image Net21K) and evaluated on more than 20 downstream image recognition tasks. |
| Researcher Affiliation | Industry | Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi Google Research {samiraabnar,dehghani,neyshabur,hsedghi}@google.com |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | All the experiments conducted in this paper are based on Scenic library (Dehghani et al., 2021b). We have shared details on the customization done in the controlled experiments in Appendix 1.1. No explicit statement about the authors' specific code for this paper being open-source or a link to it is provided. |
| Open Datasets | Yes | The models are pre-trained on either JFT (Sun et al., 2017) with 303M images and 18K classes or Image Net21K (Deng et al., 2009) with 14M images and 21K classes and evaluated on a variety of downstream datasets for few-shot and transfer learning settings. Our 25 downstream tasks cover a wide range of standard datasets that are included in benchmarks like VTAB (Zhai et al., 2019), Meta Dataset (Triantafillou et al., 2019), Wilds (Koh et al., 2020) and medical imaging. Table G.4 and G.5 summarize the datasets used in our experiments, listing many with references and/or URLs (e.g., Caltech101: http: //www.vision.caltech.edu/ Image_Datasets/Caltech101/, CIFAR-10: https://www.cs.toronto.edu/ ~kriz/cifar.html). |
| Dataset Splits | Yes | For the downstream evaluation, we mainly focus on few-shot learning setup (1, 5, 10, and 20 shots) as well as fine-tuning for some of the ablations. In the few-shot setup, a linear classifier is trained on top of the representations from the frozen pre-trained model, given only a fixed number of training examples per class. In the fine-tuning setup, we follow VTAB standard (Zhai et al., 2019) and use 1000 training samples from the downstream task and update all the parameters of the model besides the downstream head. |
| Hardware Specification | No | The paper does not specify the exact hardware (e.g., GPU models, CPU types, or specific TPU versions) used for running the experiments. It mentions models (Vision Transformers, MLP-Mixers, Res Nets) and training details, but no hardware specifics. |
| Software Dependencies | No | All the experiments conducted in this paper are based on Scenic library (Dehghani et al., 2021b). For the controlled experiments, we train all models using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999. No specific version numbers for Scenic, Adam, or any other software libraries are provided. |
| Experiment Setup | Yes | For the controlled experiments, we train all models using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999. In all experiments, the batch size is set to 4096. The deffault weight decay used in the experiments is 0.1, unless the changed value is mentioned in the description of the experiment. For the learning rate, we se the value to 8e 4 (unless for large models that we use 4e 4) and use a linear decay, with a warmup of 1000 steps. We pre-train our models on JFT for 7 epochs. |