Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Authors: Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng2344-2351

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category. We train our model using Adam W (Loshchilov and Hutter 2018) with weight decay 0.05 and a batch size of 1024. We carry out ablation studies on all three variants of the s MLPNet.
Researcher Affiliation Collaboration 1Microsoft Research Asia, Beijing, China 2University of Science and Technology of China, Hefei, China {chutan, cluo, wenxie, wezeng}@microsoft.com,{lnc, flylight}@mail.ustc.edu.cn
Pseudocode Yes Algorithm 1: Pseudocode of s MLP (Py Torch-like) Input: x # input tensor of shape (H, W, C) Output: x # output tensor of shape (H, W, C)
Open Source Code Yes The code and models are publicly available at https://github.com/microsoft/SPACH.
Open Datasets Yes We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category.
Dataset Splits Yes We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category.
Hardware Specification Yes All training is conducted with 8 NVIDIA Tesla V100 GPU cards.
Software Dependencies No The paper mentions "Py Torch-like pseudo code" but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup Yes We train our model using Adam W (Loshchilov and Hutter 2018) with weight decay 0.05 and a batch size of 1024. We use a linear warm up and cosine decay. The initial learning rate is 1e-3 and gradually drops to 1e-5 in 300 epochs. We also use label smoothing (Szegedy et al. 2016) and Drop Path (Larsson, Maire, and Shakhnarovich 2016). Drop Path rates for our tiny, small, and base models are 0, 0.2, and 0.3, respectively. For data augmentation methods, we use Rand Aug (Cubuk et al. 2020), repeated augmentation (Hoffer et al. 2020), Mix Up (Zhang et al. 2018), and Cut Mix (Zhong et al. 2020).