MLP-Mixer: An all-MLP Architecture for Vision

Authors: Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the performance of MLP-Mixer models, pre-trained with mediumto large-scale datasets, on a range of small and mid-sized downstream classification tasks. We are interested in three primary quantities: (1) Accuracy on the downstream task; (2) Total computational cost of pre-training... (3) Test-time throughput... Our goal is not to demonstrate state-of-the-art results, but to show that, remarkably, a simple MLP-based model is competitive with today s best convolutional and attention-based models.
Researcher Affiliation Industry Ilya Tolstikhin , Neil Houlsby , Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy Google Research, Brain Team {tolstikhin, neilhoulsby, akolesnikov, lbeyer, xzhai, unterthiner, jessicayung , andstein, keysers, usz, lucic, adosovitskiy}@google.com
Pseudocode Yes Overall, the architecture can be written compactly in JAX/Flax, the code is given in Supplementary F.
Open Source Code Yes MLP-Mixer code is available at https://github.com/google-research/vision_transformer
Open Datasets Yes Downstream tasks We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23], Oxford-IIIT Pets (3.7k examples, 36 classes) [33], and Oxford Flowers-102 (2k examples, 102 classes) [32]. We also use the Visual Task Adaptation Benchmark (VTAB-1k), which consists of 19 diverse datasets, each with 1k training examples [58]. Pre-training We follow the standard transfer learning setup: pre-training followed by fine-tuning on the downstream tasks. We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13].
Dataset Splits Yes We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23]... We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13].
Hardware Specification Yes For the former we compute two metrics: (1) Total pre-training time on TPU-v3 accelerators... (2) Throughput in images/sec/core on TPU-v3.
Software Dependencies No The paper mentions JAX/Flax as the framework used for implementing the architecture and refers to the 'timm library' as inspiration, but does not provide specific version numbers for any software dependencies.
Experiment Setup Yes We pre-train all models at resolution 224 using Adam with β1 = 0.9, β2 = 0.999, linear learning rate warmup of 10k steps and linear decay, batch size 4 096, weight decay, and gradient clipping at global norm 1. For JFT-300M, we pre-process images by applying the cropping technique from Szegedy et al. [45] in addition to random horizontal flipping. For Image Net and Image Net-21k, we employ additional data augmentation and regularization techniques. In particular, we use Rand Augment [12], mixup [60], dropout [43], and stochastic depth [19]. More details on these hyperparameters are provided in Supplementary B. Fine-tuning We fine-tune using momentum SGD, batch size 512, gradient clipping at global norm 1, and a cosine learning rate schedule with a linear warmup. We do not use weight decay when finetuning.