SparseFormer: Sparse Visual Recognition via Limited Latent Tokens
Authors: Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on the Image Net-1K classification show that Sparse Former delivers performance on par with canonical or well-established models while offering more favorable accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification task with promising performance with lower compute. We benchmark our presented Sparse Formers on the Image Net-1K classification (Deng et al., 2009) and Kinetics-400 (Carreira & Zisserman, 2017) video classification. We also report our preliminary trials on down-streaming tasks, semantic segmentation and object detection, in the appendix. |
| Researcher Affiliation | Collaboration | Ziteng Gao Show Lab, National University of Singapore Zhan Tong Ant Group Limin Wang Nanjing University Mike Zheng Shou Show Lab, National University of Singapore |
| Pseudocode | No | The paper describes the architecture and processes in text and diagrams (Figure 2), but does not include any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code and weights are available at https://github.com/showlab/sparseformer. |
| Open Datasets | Yes | We benchmark our presented Sparse Formers on the Image Net-1K classification (Deng et al., 2009) and Kinetics-400 (Carreira & Zisserman, 2017) video classification. For Image Net-21K pre-training (Deng et al., 2009), we use the subset, winter 2021 release, as suggested by (Ridnik et al., 2021). |
| Dataset Splits | Yes | For Image Net-1K classification (Deng et al., 2009), we train the proposed Sparse Former according to the recipe in (Liu et al., 2021a), which includes the training budget of 300 epochs, the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate 0.001, the weight decay 0.05 and sorts of augmentation and regularization strategies. The input resolution is fixed to 224x224. More visualizations on Ro Is and sampling points. In order to confirm the general ability of Sparse Former to focus on foregrounds, we present additional visualizations in Figure 4 and 5 with Image Net-1K (Deng et al., 2009) validation set inputs. |
| Hardware Specification | Yes | The throughput is measured with FP32 on a single V100 GPU following (Liu et al., 2021a). We benchmark more throughput comparisons here on a more recent A5000 GPU in Table 9 with both FP32 and FP16. |
| Software Dependencies | No | The paper mentions using the 'Adam W optimizer' but does not specify versions for any programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, CUDA versions). |
| Experiment Setup | Yes | For Image Net-1K classification (Deng et al., 2009), we train the proposed Sparse Former according to the recipe in (Liu et al., 2021a), which includes the training budget of 300 epochs, the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate 0.001, the weight decay 0.05 and sorts of augmentation and regularization strategies. The input resolution is fixed to 224x224. We add EMA (Polyak & Juditsky, 1992) to stabilize the training. The stochastic depth (i.e., drop path) (Huang et al., 2016) rate is set to 0.2, 0.3, and 0.4 for Sparse Former-T, -S, and -B. For training on Kinetics-400 Carreira & Zisserman (2017), we use Image Net pre-trained weights to initialize Sparse Formers... The number of input frames is set to T = 32... We train the model for 50 epochs with 5 linear warm-up epochs. The mini-batch size is 8 videos per GPU. The learning rate is set to 5e-4, and we adopt a cosine learning rate schedule (Loshchilov & Hutter, 2016). |