AS-MLP: An Axial Shifted MLP Architecture for Vision

Authors: Dongze Lian, Zehao Yu, Xing Sun, Shenghua Gao

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To evaluate the effectiveness of our AS-MLP, we conduct experiments of the image classification on the Image Net-1K benchmark... All image classification results are shown in Table 1. We divide all network architectures into CNN-based, Transformer-based and MLP-based architectures... The experimental results show that our model significantly exceeds Swin Transformer (Liu et al., 2021b) in the mobile setting (76.05% vs. 75.11%). We also compare the different connection types of AS-MLP block, such as serial connection and parallel connection, and the results are shown in Table 3b.
Researcher Affiliation Collaboration Dongze Lian , Zehao Yu Shanghai Tech University {liandz,yuzh}@shanghaitech.edu.cn Xing Sun Youtu Lab, Tencent {winfredsun}@tencent.com Shenghua Gao Shanghai Tech University & Shanghai Engineering Research Center of Intelligent Vision and Imaging & Shanghai Engineering Research Center of Energy Efficient and Custom AI IC {gaoshh}@shanghaitech.edu.cn
Pseudocode Yes Algorithm 1 Code of AS-MLP Block in a Py Torch-like style.
Open Source Code Yes Code is available at https://github.com/svip-lab/AS-MLP.
Open Datasets Yes To evaluate the effectiveness of our AS-MLP, we conduct experiments of the image classification on the Image Net-1K benchmark, which is collected in (Deng et al., 2009). It contains 1.28M training images and 20K validation images from a total of 1000 classes. For the object detection and instance segmentation, we employ mmdetection (Chen et al., 2019) as the framework and COCO (Lin et al., 2014) as the evaluation dataset, which consists of 118K training data and 5K validation data. Following Swin Transformer (Liu et al., 2021b), we conduct experiments of AS-MLP on the challenging semantic segmentation dataset, ADE20K, which contains 20,210 training images and 2,000 validation images.
Dataset Splits Yes It contains 1.28M training images and 20K validation images from a total of 1000 classes. COCO (Lin et al., 2014) as the evaluation dataset, which consists of 118K training data and 5K validation data. ADE20K, which contains 20,210 training images and 2,000 validation images.
Hardware Specification Yes Throughput is measured with the batch size of 64 on a single V100 GPU (32GB).
Software Dependencies No The paper mentions `import torch` and `import torch.nn.functional as F` in Algorithm 1, implying the use of PyTorch. However, no specific version numbers are provided for PyTorch or any other software components.
Experiment Setup Yes We use an initial learning rate of 0.001 with cosine decay and 20 epochs of linear warm-up. The Adam W (Loshchilov & Hutter, 2019) optimizer is employed to train the whole model for 300 epochs with a batch size of 1024. Following the training strategy of Swin Transformer (Liu et al., 2021b), we also use label smoothing (Szegedy et al., 2016) with a smooth ratio of 0.1 and Drop Path (Huang et al., 2016) strategy. For object detection: optimizer (Adam W), learning rate (0.0001), weight decay (0.05), and batch size (2 imgs/per GPU 8 GPUs). For semantic segmentation: optimizer (Adam W), learning rate (6 10 5), weight decay (0.01), and batch size (2 imgs/per GPU 8 GPUs). The input image resolution is 512 512, the stochastic depth ratio is set as 0.3 and all models are initialized with weights pre-trained on Image Net-1K and are trained 160K iterations.