reproducibilityindex.ai

SparseFormer: Sparse Visual Recognition via Limited Latent Tokens

Authors: Ziteng Gao, Zhan Tong, Limin Wang, Mike Zheng Shou

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on the Image Net-1K classification show that Sparse Former delivers performance on par with canonical or well-established models while offering more favorable accuracy-throughput tradeoff. Moreover, the design of our network can be easily extended to the video classification task with promising performance with lower compute. We benchmark our presented Sparse Formers on the Image Net-1K classification (Deng et al., 2009) and Kinetics-400 (Carreira & Zisserman, 2017) video classification. We also report our preliminary trials on down-streaming tasks, semantic segmentation and object detection, in the appendix.
Researcher Affiliation	Collaboration	Ziteng Gao Show Lab, National University of Singapore Zhan Tong Ant Group Limin Wang Nanjing University Mike Zheng Shou Show Lab, National University of Singapore
Pseudocode	No	The paper describes the architecture and processes in text and diagrams (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code	Yes	Code and weights are available at https://github.com/showlab/sparseformer.
Open Datasets	Yes	We benchmark our presented Sparse Formers on the Image Net-1K classification (Deng et al., 2009) and Kinetics-400 (Carreira & Zisserman, 2017) video classification. For Image Net-21K pre-training (Deng et al., 2009), we use the subset, winter 2021 release, as suggested by (Ridnik et al., 2021).
Dataset Splits	Yes	For Image Net-1K classification (Deng et al., 2009), we train the proposed Sparse Former according to the recipe in (Liu et al., 2021a), which includes the training budget of 300 epochs, the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate 0.001, the weight decay 0.05 and sorts of augmentation and regularization strategies. The input resolution is fixed to 224x224. More visualizations on Ro Is and sampling points. In order to confirm the general ability of Sparse Former to focus on foregrounds, we present additional visualizations in Figure 4 and 5 with Image Net-1K (Deng et al., 2009) validation set inputs.
Hardware Specification	Yes	The throughput is measured with FP32 on a single V100 GPU following (Liu et al., 2021a). We benchmark more throughput comparisons here on a more recent A5000 GPU in Table 9 with both FP32 and FP16.
Software Dependencies	No	The paper mentions using the 'Adam W optimizer' but does not specify versions for any programming languages, libraries, or frameworks (e.g., PyTorch, TensorFlow, CUDA versions).
Experiment Setup	Yes	For Image Net-1K classification (Deng et al., 2009), we train the proposed Sparse Former according to the recipe in (Liu et al., 2021a), which includes the training budget of 300 epochs, the Adam W optimizer (Loshchilov & Hutter, 2017) with an initial learning rate 0.001, the weight decay 0.05 and sorts of augmentation and regularization strategies. The input resolution is fixed to 224x224. We add EMA (Polyak & Juditsky, 1992) to stabilize the training. The stochastic depth (i.e., drop path) (Huang et al., 2016) rate is set to 0.2, 0.3, and 0.4 for Sparse Former-T, -S, and -B. For training on Kinetics-400 Carreira & Zisserman (2017), we use Image Net pre-trained weights to initialize Sparse Formers... The number of input frames is set to T = 32... We train the model for 50 epochs with 5 linear warm-up epochs. The mini-batch size is 8 videos per GPU. The learning rate is set to 5e-4, and we adopt a cosine learning rate schedule (Loshchilov & Hutter, 2016).