MLP-Mixer: An all-MLP Architecture for Vision
Authors: Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of MLP-Mixer models, pre-trained with mediumto large-scale datasets, on a range of small and mid-sized downstream classification tasks. We are interested in three primary quantities: (1) Accuracy on the downstream task; (2) Total computational cost of pre-training... (3) Test-time throughput... Our goal is not to demonstrate state-of-the-art results, but to show that, remarkably, a simple MLP-based model is competitive with today s best convolutional and attention-based models. |
| Researcher Affiliation | Industry | Ilya Tolstikhin , Neil Houlsby , Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy Google Research, Brain Team {tolstikhin, neilhoulsby, akolesnikov, lbeyer, xzhai, unterthiner, jessicayung , andstein, keysers, usz, lucic, adosovitskiy}@google.com |
| Pseudocode | Yes | Overall, the architecture can be written compactly in JAX/Flax, the code is given in Supplementary F. |
| Open Source Code | Yes | MLP-Mixer code is available at https://github.com/google-research/vision_transformer |
| Open Datasets | Yes | Downstream tasks We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23], Oxford-IIIT Pets (3.7k examples, 36 classes) [33], and Oxford Flowers-102 (2k examples, 102 classes) [32]. We also use the Visual Task Adaptation Benchmark (VTAB-1k), which consists of 19 diverse datasets, each with 1k training examples [58]. Pre-training We follow the standard transfer learning setup: pre-training followed by fine-tuning on the downstream tasks. We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13]. |
| Dataset Splits | Yes | We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23]... We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13]. |
| Hardware Specification | Yes | For the former we compute two metrics: (1) Total pre-training time on TPU-v3 accelerators... (2) Throughput in images/sec/core on TPU-v3. |
| Software Dependencies | No | The paper mentions JAX/Flax as the framework used for implementing the architecture and refers to the 'timm library' as inspiration, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We pre-train all models at resolution 224 using Adam with β1 = 0.9, β2 = 0.999, linear learning rate warmup of 10k steps and linear decay, batch size 4 096, weight decay, and gradient clipping at global norm 1. For JFT-300M, we pre-process images by applying the cropping technique from Szegedy et al. [45] in addition to random horizontal flipping. For Image Net and Image Net-21k, we employ additional data augmentation and regularization techniques. In particular, we use Rand Augment [12], mixup [60], dropout [43], and stochastic depth [19]. More details on these hyperparameters are provided in Supplementary B. Fine-tuning We fine-tune using momentum SGD, batch size 512, gradient clipping at global norm 1, and a cosine learning rate schedule with a linear warmup. We do not use weight decay when finetuning. |