reproducibilityindex.ai

Sparse MLP for Image Recognition: Is Self-Attention Really Necessary?

Authors: Chuanxin Tang, Yucheng Zhao, Guangting Wang, Chong Luo, Wenxuan Xie, Wenjun Zeng2344-2351

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category. We train our model using Adam W (Loshchilov and Hutter 2018) with weight decay 0.05 and a batch size of 1024. We carry out ablation studies on all three variants of the s MLPNet.
Researcher Affiliation	Collaboration	1Microsoft Research Asia, Beijing, China 2University of Science and Technology of China, Hefei, China {chutan, cluo, wenxie, wezeng}@microsoft.com,{lnc, ﬂylight}@mail.ustc.edu.cn
Pseudocode	Yes	Algorithm 1: Pseudocode of s MLP (Py Torch-like) Input: x # input tensor of shape (H, W, C) Output: x # output tensor of shape (H, W, C)
Open Source Code	Yes	The code and models are publicly available at https://github.com/microsoft/SPACH.
Open Datasets	Yes	We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category.
Dataset Splits	Yes	We evaluate our model based on Image Net-1K dataset (Krizhevsky, Sutskever, and Hinton 2012) which contains 1.2 million training images from one thousand categories and 50 thousand validation images with 50 images in each category.
Hardware Specification	Yes	All training is conducted with 8 NVIDIA Tesla V100 GPU cards.
Software Dependencies	No	The paper mentions "Py Torch-like pseudo code" but does not specify version numbers for PyTorch or any other software dependencies.
Experiment Setup	Yes	We train our model using Adam W (Loshchilov and Hutter 2018) with weight decay 0.05 and a batch size of 1024. We use a linear warm up and cosine decay. The initial learning rate is 1e-3 and gradually drops to 1e-5 in 300 epochs. We also use label smoothing (Szegedy et al. 2016) and Drop Path (Larsson, Maire, and Shakhnarovich 2016). Drop Path rates for our tiny, small, and base models are 0, 0.2, and 0.3, respectively. For data augmentation methods, we use Rand Aug (Cubuk et al. 2020), repeated augmentation (Hoffer et al. 2020), Mix Up (Zhang et al. 2018), and Cut Mix (Zhong et al. 2020).