reproducibilityindex.ai

Exploring the Limits of Large Scale Pre-training

Authors: Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, Hanie Sedghi

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this work we systematically study this phenomena and establish that, as we increase the upstream accuracy, performance of downstream tasks saturates. In particular, we investigate more than 4800 experiments on Vision Transformers, MLP-Mixers and Res Nets with number of parameters ranging from ten million to ten billion, trained on the largest scale of available image data (JFT, Image Net21K) and evaluated on more than 20 downstream image recognition tasks.
Researcher Affiliation	Industry	Samira Abnar, Mostafa Dehghani, Behnam Neyshabur, and Hanie Sedghi Google Research {samiraabnar,dehghani,neyshabur,hsedghi}@google.com
Pseudocode	No	The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code	No	All the experiments conducted in this paper are based on Scenic library (Dehghani et al., 2021b). We have shared details on the customization done in the controlled experiments in Appendix 1.1. No explicit statement about the authors' specific code for this paper being open-source or a link to it is provided.
Open Datasets	Yes	The models are pre-trained on either JFT (Sun et al., 2017) with 303M images and 18K classes or Image Net21K (Deng et al., 2009) with 14M images and 21K classes and evaluated on a variety of downstream datasets for few-shot and transfer learning settings. Our 25 downstream tasks cover a wide range of standard datasets that are included in benchmarks like VTAB (Zhai et al., 2019), Meta Dataset (Triantaﬁllou et al., 2019), Wilds (Koh et al., 2020) and medical imaging. Table G.4 and G.5 summarize the datasets used in our experiments, listing many with references and/or URLs (e.g., Caltech101: http: //www.vision.caltech.edu/ Image_Datasets/Caltech101/, CIFAR-10: https://www.cs.toronto.edu/ ~kriz/cifar.html).
Dataset Splits	Yes	For the downstream evaluation, we mainly focus on few-shot learning setup (1, 5, 10, and 20 shots) as well as ﬁne-tuning for some of the ablations. In the few-shot setup, a linear classiﬁer is trained on top of the representations from the frozen pre-trained model, given only a ﬁxed number of training examples per class. In the ﬁne-tuning setup, we follow VTAB standard (Zhai et al., 2019) and use 1000 training samples from the downstream task and update all the parameters of the model besides the downstream head.
Hardware Specification	No	The paper does not specify the exact hardware (e.g., GPU models, CPU types, or specific TPU versions) used for running the experiments. It mentions models (Vision Transformers, MLP-Mixers, Res Nets) and training details, but no hardware specifics.
Software Dependencies	No	All the experiments conducted in this paper are based on Scenic library (Dehghani et al., 2021b). For the controlled experiments, we train all models using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999. No specific version numbers for Scenic, Adam, or any other software libraries are provided.
Experiment Setup	Yes	For the controlled experiments, we train all models using Adam (Kingma & Ba, 2014) with β1 = 0.9, β2 = 0.999. In all experiments, the batch size is set to 4096. The deffault weight decay used in the experiments is 0.1, unless the changed value is mentioned in the description of the experiment. For the learning rate, we se the value to 8e 4 (unless for large models that we use 4e 4) and use a linear decay, with a warmup of 1000 steps. We pre-train our models on JFT for 7 epochs.