Scaling Vision Transformers to 22 Billion Parameters
Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd Van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, Neil Houlsby
ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | When evaluated on downstream tasks (often with a lightweight linear model on frozen features), Vi T-22B demonstrates increasing performance with scale. We further observe other interesting benefits of scale, including an improved tradeoff between fairness and performance, stateof-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness. |
| Researcher Affiliation | Industry | Mostafa Dehghani * Josip Djolonga * Basil Mustafa * Piotr Padlewski * Jonathan Heek * Justin Gilmer Andreas Steiner Mathilde Caron Robert Geirhos Ibrahim Alabdulmohsin Rodolphe Jenatton Lucas Beyer Michael Tschannen Anurag Arnab Xiao Wang Carlos Riquelme Matthias Minderer Joan Puigcerver Utku Evci Manoj Kumar Sjoerd van Steenkiste Gamaleldin F. Elsayed Aravindh Mahendran Fisher Yu Avital Oliver Fantine Huot Jasmijn Bastings Mark Patrick Collier Alexey A. Gritsenko Vighnesh Birodkar Cristina Vasconcelos Yi Tay Thomas Mensink Alexander Kolesnikov Filip Paveti c Dustin Tran Thomas Kipf Mario Luˇci c Xiaohua Zhai Daniel Keysers Jeremiah Harmsen Neil Houlsby * Google Research |
| Pseudocode | No | The paper includes architectural diagrams (Figure 2, Figure 3) but no explicit pseudocode or algorithm blocks. |
| Open Source Code | No | The paper mentions implementing Vi T-22B in JAX, FLAX, and Scenic and provides citations to these libraries, but it does not state that the code for Vi T-22B itself is open-source or provide a link to its own repository. |
| Open Datasets | Yes | Vi T-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels. |
| Dataset Splits | Yes | We explored various ways of training a linear probe, our final setup on Image Net uses SGD with momentum for 10 epochs at 224px resolution, with mild random cropping and horizontal flipping as the only data augmentations, and no further regularizations. ... Specifically, we precompute image embeddings by resizing input images to 224px resolution and then solve the multiclass logistic regression problem with L-BFGS. We also sweep the L2 regularization parameter and select the optimal one using 20000 holdout images from the training data (approximately 2% of the training data). |
| Hardware Specification | Yes | Vi T-22B processes 1.15k tokens per second per core during training (forward and backward pass) on TPUv4 (Jouppi et al., 2020). ... Training Hardware: TPU v4 (Jouppi et al., 2020). ... Vi T-22B was trained on 1024 TPU V4 chips for 177K steps. |
| Software Dependencies | No | Vi T-22B is implemented in JAX (Bradbury et al., 2018) using the FLAX library (Heek et al., 2020) and built within Scenic (Dehghani et al., 2022). While the software names are mentioned with their publication years, explicit version numbers for reproducibility (e.g., JAX 0.x.x, FLAX 0.x.x) are not provided. |
| Experiment Setup | Yes | Vi T-22B was trained using 256 visual tokens per image, where each token represents a 14 14 patch extracted from 224 224 sized images. Vi T-22B is trained for 177k steps with batch size of 65k: approximately 3 epochs. We use a reciprocal square-root learning rate schedule with a peak of 10 3, and linear warmup (first 10k steps) and cooldown (last 30k steps) phases. For better fewshot adaptation, we use a higher weight decay on the head (3.0) than body (0.03) for upstream training (Zhai et al., 2022a; Abnar et al., 2021). |