Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
MLP-Mixer: An all-MLP Architecture for Vision
Authors: Ilya O. Tolstikhin, Neil Houlsby, Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the performance of MLP-Mixer models, pre-trained with mediumto large-scale datasets, on a range of small and mid-sized downstream classification tasks. We are interested in three primary quantities: (1) Accuracy on the downstream task; (2) Total computational cost of pre-training... (3) Test-time throughput... Our goal is not to demonstrate state-of-the-art results, but to show that, remarkably, a simple MLP-based model is competitive with today s best convolutional and attention-based models. |
| Researcher Affiliation | Industry | Ilya Tolstikhin , Neil Houlsby , Alexander Kolesnikov , Lucas Beyer , Xiaohua Zhai, Thomas Unterthiner, Jessica Yung, Andreas Steiner, Daniel Keysers, Jakob Uszkoreit, Mario Lucic, Alexey Dosovitskiy Google Research, Brain Team EMAIL |
| Pseudocode | Yes | Overall, the architecture can be written compactly in JAX/Flax, the code is given in Supplementary F. |
| Open Source Code | Yes | MLP-Mixer code is available at https://github.com/google-research/vision_transformer |
| Open Datasets | Yes | Downstream tasks We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23], Oxford-IIIT Pets (3.7k examples, 36 classes) [33], and Oxford Flowers-102 (2k examples, 102 classes) [32]. We also use the Visual Task Adaptation Benchmark (VTAB-1k), which consists of 19 diverse datasets, each with 1k training examples [58]. Pre-training We follow the standard transfer learning setup: pre-training followed by fine-tuning on the downstream tasks. We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13]. |
| Dataset Splits | Yes | We use popular downstream tasks such as ILSVRC2012 Image Net (1.3M training examples, 1k classes) with the original validation labels [13] and cleaned-up Rea L labels [5], CIFAR-10/100 (50k examples, 10/100 classes) [23]... We pre-train our models on two public datasets: ILSVRC2021 Image Net, and Image Net-21k, a superset of ILSVRC2012 that contains 21k classes and 14M images [13]. |
| Hardware Specification | Yes | For the former we compute two metrics: (1) Total pre-training time on TPU-v3 accelerators... (2) Throughput in images/sec/core on TPU-v3. |
| Software Dependencies | No | The paper mentions JAX/Flax as the framework used for implementing the architecture and refers to the 'timm library' as inspiration, but does not provide specific version numbers for any software dependencies. |
| Experiment Setup | Yes | We pre-train all models at resolution 224 using Adam with β1 = 0.9, β2 = 0.999, linear learning rate warmup of 10k steps and linear decay, batch size 4 096, weight decay, and gradient clipping at global norm 1. For JFT-300M, we pre-process images by applying the cropping technique from Szegedy et al. [45] in addition to random horizontal flipping. For Image Net and Image Net-21k, we employ additional data augmentation and regularization techniques. In particular, we use Rand Augment [12], mixup [60], dropout [43], and stochastic depth [19]. More details on these hyperparameters are provided in Supplementary B. Fine-tuning We fine-tune using momentum SGD, batch size 512, gradient clipping at global norm 1, and a cosine learning rate schedule with a linear warmup. We do not use weight decay when finetuning. |