reproducibilityindex.ai

MLP-Mixer: An all-MLP Architecture for Vision

Authors: Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the performance of MLP-Mixer models, pre-trained with mediumto large-scale datasets, on a range of small and mid-sized downstream classiﬁcation tasks. We are interested in three primary quantities: (1) Accuracy on the downstream task; (2) Total computational cost of pre-training... (3) Test-time throughput... Our goal is not to demonstrate state-of-the-art results, but to show that, remarkably, a simple MLP-based model is competitive with today s best convolutional and attention-based models.
Researcher Affiliation	Industry	Ilya Tolstikhin , Neil Houlsby , Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy Google Research, Brain Team {tolstikhin, neilhoulsby, akolesnikov, lbeyer, xzhai, unterthiner, jessicayung , andstein, keysers, usz, lucic, adosovitskiy}@google.com
Pseudocode	Yes	Overall, the architecture can be written compactly in JAX/Flax, the code is given in Supplementary F.
Open Source Code	Yes	MLP-Mixer code is available at https://github.com/google-research/vision_transformer
Open Datasets	Yes	Downstream tasks We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23], Oxford-IIIT Pets (3.7k examples, 36 classes) [33], and Oxford Flowers-102 (2k examples, 102 classes) [32]. We also use the Visual Task Adaptation Benchmark (VTAB-1k), which consists of 19 diverse datasets, each with 1k training examples [58]. Pre-training We follow the standard transfer learning setup: pre-training followed by ﬁne-tuning on the downstream tasks. We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13].
Dataset Splits	Yes	We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23]... We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13].
Hardware Specification	Yes	For the former we compute two metrics: (1) Total pre-training time on TPU-v3 accelerators... (2) Throughput in images/sec/core on TPU-v3.
Software Dependencies	No	The paper mentions JAX/Flax as the framework used for implementing the architecture and refers to the 'timm library' as inspiration, but does not provide specific version numbers for any software dependencies.
Experiment Setup	Yes	We pre-train all models at resolution 224 using Adam with β1 = 0.9, β2 = 0.999, linear learning rate warmup of 10k steps and linear decay, batch size 4 096, weight decay, and gradient clipping at global norm 1. For JFT-300M, we pre-process images by applying the cropping technique from Szegedy et al. [45] in addition to random horizontal ﬂipping. For Image Net and Image Net-21k, we employ additional data augmentation and regularization techniques. In particular, we use Rand Augment [12], mixup [60], dropout [43], and stochastic depth [19]. More details on these hyperparameters are provided in Supplementary B. Fine-tuning We ﬁne-tune using momentum SGD, batch size 512, gradient clipping at global norm 1, and a cosine learning rate schedule with a linear warmup. We do not use weight decay when ﬁnetuning.