Scalable Pre-training of Large Autoregressive Image Models

Authors: Alaaeldin El-Nouby, Michal Klein, Shuangfei Zhai, Miguel Ángel Bautista, Vaishaal Shankar, Alexander T Toshev, Joshua M. Susskind, Armand Joulin

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental This paper introduces AIM, a collection of vision models pre-trained with an autoregressive objective. These models are inspired by their textual counterparts, i.e., Large Language Models (LLMs), and exhibit similar scaling properties. Specifically, we highlight two key findings: (1) the performance of the visual features scale with both the model capacity and the quantity of data, (2) the value of the objective function correlates with the performance of the model on downstream tasks. We illustrate the practical implication of these findings by pre-training a 7 billion parameter AIM on 2 billion images, that achieves 84.0% on Image Net-1k with a frozen trunk. ... We provide a study of a series of models, ranging from 600M to 7B parameters pre-trained using 2B uncurated images with permissive licenses. AIM exhibits strong scaling behavior w.r.t. the model size as shown in Figure 1 where higher capacity models achieve better downstream performance, measured as the average accuracy over 15 image recognition benchmarks. ... In Figure 4, we measure for each model the value of the pre-training loss and the classification accuracy on the validations set, as a function of the number of training iterations. ... In Table 6, we compare the attentive probing performance of AIM to other state-of-the-art methods across a set of 15 diverse benchmarks...
Researcher Affiliation Industry Alaaeldin El-Nouby 1 Michal Klein 1 Shuangfei Zhai 1 Miguel Angel Bautista 1 Vaishaal Shankar 1 Alexander Toshev 1 Joshua M Susskind 1 Armand Joulin 1 1Apple. Work done while with Apple. Now at Google Deep Mind. Correspondence to: <alaaeldin ali@apple.com>.
Pseudocode No The paper describes its approach and architecture through textual descriptions and diagrams (e.g., Figure 2, Figure 3), but it does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code Yes https://github.com/apple/ml-aim
Open Datasets Yes We pre-train our models on the DFN dataset introduced by Fang et al. [2023]. This dataset is composed of a larger collection of 12.8B image-text pairs [Gadre et al., 2023] filtered from Common Crawl. ... We sample images from DFN-2B with a probability of p = 0.8 and sample images from Image Net-1k with a probability of p = 0.2.
Dataset Splits Yes For all of these experiments, we report the value of our loss function on the validation set of IN-1k." and "Dataset train test classes Imagenet-1k [Deng et al., 2009] 1,281,167 50,000 1000 ... CIFAR-10 [Krizhevsky et al., 2009] 50,000 10,000 10 ... Table 9: Evaluation benchmarks. We provide the references, the number of images in the train and test sets, and the number of categories of all the 15 recognition benchmarks used in this work.
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or cloud instance specifications used for running the experiments. It only mentions general training aspects like 'bfloat16 precision' and batch sizes without specifying the underlying compute infrastructure.
Software Dependencies No The paper describes the architectures and models used, such as Vision Transformer and references to LLMs, but it does not specify any software dependencies or libraries with their version numbers (e.g., Python version, PyTorch version, TensorFlow version) that would be necessary for replication.
Experiment Setup Yes Table 1: Model specifications. We provide the embedding dimension, number of layers, and parameter count for all AIM variants. We also provide the learning rate and batch size during pretraining. ... Table 10: Pre-training hyperparameters All AIM variants of different capacities have been trained using the same set of hyperparameters detailed above. config value Optimizer Adam W Optimizer Momentum β1 = 0.9, β2 = 0.95 Peak learning rate 1e 3 Minimum Learning rate 0.0 Weight decay 0.05 Batch size 4096 Patch size (14, 14) Gradient clipping 1.0 Warmup iterations 31,2050 Total iterations 1,250,000 Learning rate schedule cosine decay Augmentations: Random Resized Crop size 224px scale [0.4, 1.0] ratio [0.75, 1.33] interpolation Bicubic Random Horizontal Flip p = 0.5