Understanding MLP-Mixer as a wide and sparse MLP
Authors: Tomohiro Hayase, Ryo Karakida
ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Next, for general cases, we empirically demonstrate quantitative similarities between the Mixer and the unstructured sparse-weight MLPs. In this study, we reveal that the sparseness, which is seemingly a distinct research concept, is the key mechanism underlying the MLP-Mixer. We validate the similarities between the Monarch matrix and the Mixer through experiments. We trained the normal and RP S-Mixers for various values of S and C with fixed Ω. Table 1. Test error on CIFAR-10/CIFAR-100/Image Net-1k from scratch. |
| Researcher Affiliation | Collaboration | 1Metaverse Lab, Cluster Inc. 2Artificial Intelligence Research Center, AIST. Correspondence to: Tomohiro Hayase <t.hayase@cluster.mu>, Ryo Karakida <karakida.ryo@aist.go.jp>. |
| Pseudocode | No | The paper describes methods through mathematical formulations and textual explanations but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper refers to the 'timm library' for ImageNet training but does not provide a link or statement about releasing the source code for the methodology described in this paper. |
| Open Datasets | Yes | Test error on MNIST. Test error on CIFAR-10/CIFAR-100/Image Net-1k. We trained the models on the CIFAR-10, CIFAR-100, and STL-10 datasets. |
| Dataset Splits | No | The paper uses standard datasets like CIFAR-10, CIFAR-100, and ImageNet-1k, which typically have predefined train/test/validation splits. However, the paper does not explicitly state the specific percentages or sample counts for these splits within its text. |
| Hardware Specification | Yes | For our experiments, we utilized Tesla V100 GPUs, accumulating approximately 300 GPU hours. We utilized Tesla V100 GPUs and approximately 4000 GPU hours for our experiments. for training MLP-Mixer and RP MLP-Mixer on Image Net-1k; we used a GPU cluster of 32 nodes of 4 GPUs per node for each run. |
| Software Dependencies | No | The paper mentions using the 'timm library' and 'Pytorch image models' but does not specify version numbers for these or any other software dependencies used in the experiments. |
| Experiment Setup | Yes | Each network is trained on CIFAR10 with a batch size of 128, for 600 epochs, a learning rate of 0.01, using auto-augmentation, Adam W optimizer, momentum set to 0.9, and cosine annealing. We employed Nesterov SGD with a mini-batch size 128 and a momentum of 0.9 for training, running for 200 epochs. The initial learning rate was set to 0.02, and we used cosine annealing for learning rate scheduling. We used Adam W with an initial learning rate of 10^-3 and 300 epochs. We set the mini-batch size to 4096 and used data-parallel training with a batch size of 32 in each GPU. We use the warm-up of with the warm-up learning rate 10^-6 and the warm-up epoch 5. We used the cosine annealing of the learning rate with a minimum learning rate 10^-5. We used the weight-decay 0.05. We applied the random erasing in images with a ratio of 0.25. We also applied the random auto-augmentation with a policy rand-m9-mstd0.5-inc1. We used the mix-up with α = 0.8 and the cut-mix with α = 1.0 by switching them in probability 0.5. We used the label smoothing with ε = 0.1. |