reproducibilityindex.ai

Scaling Vision Transformers to 22 Billion Parameters

Authors: Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alabdulmohsin, Rodolphe Jenatton, Lucas Beyer, Michael Tschannen, Anurag Arnab, Xiao Wang, Carlos Riquelme Ruiz, Matthias Minderer, Joan Puigcerver, Utku Evci, Manoj Kumar, Sjoerd Van Steenkiste, Gamaleldin Fathy Elsayed, Aravindh Mahendran, Fisher Yu, Avital Oliver, Fantine Huot, Jasmijn Bastings, Mark Collier, Alexey A. Gritsenko, Vighnesh Birodkar, Cristina Nader Vasconcelos, Yi Tay, Thomas Mensink, Alexander Kolesnikov, Filip Pavetic, Dustin Tran, Thomas Kipf, Mario Lucic, Xiaohua Zhai, Daniel Keysers, Jeremiah J. Harmsen, Neil Houlsby

ICML 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	When evaluated on downstream tasks (often with a lightweight linear model on frozen features), Vi T-22B demonstrates increasing performance with scale. We further observe other interesting beneﬁts of scale, including an improved tradeoff between fairness and performance, stateof-the-art alignment to human visual perception in terms of shape/texture bias, and improved robustness.
Researcher Affiliation	Industry	Mostafa Dehghani * Josip Djolonga * Basil Mustafa * Piotr Padlewski * Jonathan Heek * Justin Gilmer Andreas Steiner Mathilde Caron Robert Geirhos Ibrahim Alabdulmohsin Rodolphe Jenatton Lucas Beyer Michael Tschannen Anurag Arnab Xiao Wang Carlos Riquelme Matthias Minderer Joan Puigcerver Utku Evci Manoj Kumar Sjoerd van Steenkiste Gamaleldin F. Elsayed Aravindh Mahendran Fisher Yu Avital Oliver Fantine Huot Jasmijn Bastings Mark Patrick Collier Alexey A. Gritsenko Vighnesh Birodkar Cristina Vasconcelos Yi Tay Thomas Mensink Alexander Kolesnikov Filip Paveti c Dustin Tran Thomas Kipf Mario Luˇci c Xiaohua Zhai Daniel Keysers Jeremiah Harmsen Neil Houlsby * Google Research
Pseudocode	No	The paper includes architectural diagrams (Figure 2, Figure 3) but no explicit pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions implementing Vi T-22B in JAX, FLAX, and Scenic and provides citations to these libraries, but it does not state that the code for Vi T-22B itself is open-source or provide a link to its own repository.
Open Datasets	Yes	Vi T-22B is trained on a version of JFT (Sun et al., 2017), extended to around 4B images (Zhai et al., 2022a). These images have been semi-automatically annotated with a class-hierarchy of 30k labels.
Dataset Splits	Yes	We explored various ways of training a linear probe, our ﬁnal setup on Image Net uses SGD with momentum for 10 epochs at 224px resolution, with mild random cropping and horizontal ﬂipping as the only data augmentations, and no further regularizations. ... Speciﬁcally, we precompute image embeddings by resizing input images to 224px resolution and then solve the multiclass logistic regression problem with L-BFGS. We also sweep the L2 regularization parameter and select the optimal one using 20000 holdout images from the training data (approximately 2% of the training data).
Hardware Specification	Yes	Vi T-22B processes 1.15k tokens per second per core during training (forward and backward pass) on TPUv4 (Jouppi et al., 2020). ... Training Hardware: TPU v4 (Jouppi et al., 2020). ... Vi T-22B was trained on 1024 TPU V4 chips for 177K steps.
Software Dependencies	No	Vi T-22B is implemented in JAX (Bradbury et al., 2018) using the FLAX library (Heek et al., 2020) and built within Scenic (Dehghani et al., 2022). While the software names are mentioned with their publication years, explicit version numbers for reproducibility (e.g., JAX 0.x.x, FLAX 0.x.x) are not provided.
Experiment Setup	Yes	Vi T-22B was trained using 256 visual tokens per image, where each token represents a 14 14 patch extracted from 224 224 sized images. Vi T-22B is trained for 177k steps with batch size of 65k: approximately 3 epochs. We use a reciprocal square-root learning rate schedule with a peak of 10 3, and linear warmup (ﬁrst 10k steps) and cooldown (last 30k steps) phases. For better fewshot adaptation, we use a higher weight decay on the head (3.0) than body (0.03) for upstream training (Zhai et al., 2022a; Abnar et al., 2021).