Understanding MLP-Mixer as a wide and sparse MLP

Authors: Tomohiro Hayase, Ryo Karakida

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Next, for general cases, we empirically demonstrate quantitative similarities between the Mixer and the unstructured sparse-weight MLPs. In this study, we reveal that the sparseness, which is seemingly a distinct research concept, is the key mechanism underlying the MLP-Mixer. We validate the similarities between the Monarch matrix and the Mixer through experiments. We trained the normal and RP S-Mixers for various values of S and C with fixed Ω. Table 1. Test error on CIFAR-10/CIFAR-100/Image Net-1k from scratch.
Researcher Affiliation Collaboration 1Metaverse Lab, Cluster Inc. 2Artificial Intelligence Research Center, AIST. Correspondence to: Tomohiro Hayase <t.hayase@cluster.mu>, Ryo Karakida <karakida.ryo@aist.go.jp>.
Pseudocode No The paper describes methods through mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code No The paper refers to the 'timm library' for ImageNet training but does not provide a link or statement about releasing the source code for the methodology described in this paper.
Open Datasets Yes Test error on MNIST. Test error on CIFAR-10/CIFAR-100/Image Net-1k. We trained the models on the CIFAR-10, CIFAR-100, and STL-10 datasets.
Dataset Splits No The paper uses standard datasets like CIFAR-10, CIFAR-100, and ImageNet-1k, which typically have predefined train/test/validation splits. However, the paper does not explicitly state the specific percentages or sample counts for these splits within its text.
Hardware Specification Yes For our experiments, we utilized Tesla V100 GPUs, accumulating approximately 300 GPU hours. We utilized Tesla V100 GPUs and approximately 4000 GPU hours for our experiments. for training MLP-Mixer and RP MLP-Mixer on Image Net-1k; we used a GPU cluster of 32 nodes of 4 GPUs per node for each run.
Software Dependencies No The paper mentions using the 'timm library' and 'Pytorch image models' but does not specify version numbers for these or any other software dependencies used in the experiments.
Experiment Setup Yes Each network is trained on CIFAR10 with a batch size of 128, for 600 epochs, a learning rate of 0.01, using auto-augmentation, Adam W optimizer, momentum set to 0.9, and cosine annealing. We employed Nesterov SGD with a mini-batch size 128 and a momentum of 0.9 for training, running for 200 epochs. The initial learning rate was set to 0.02, and we used cosine annealing for learning rate scheduling. We used Adam W with an initial learning rate of 10^-3 and 300 epochs. We set the mini-batch size to 4096 and used data-parallel training with a batch size of 32 in each GPU. We use the warm-up of with the warm-up learning rate 10^-6 and the warm-up epoch 5. We used the cosine annealing of the learning rate with a minimum learning rate 10^-5. We used the weight-decay 0.05. We applied the random erasing in images with a ratio of 0.25. We also applied the random auto-augmentation with a policy rand-m9-mstd0.5-inc1. We used the mix-up with α = 0.8 and the cut-mix with α = 1.0 by switching them in probability 0.5. We used the label smoothing with ε = 0.1.