Twins: Revisiting the Design of Spatial Attention in Vision Transformers

Authors: Xiangxiang Chu, Zhi Tian, Yuqing Wang, Bo Zhang, Haibing Ren, Xiaolin Wei, Huaxia Xia, Chunhua Shen

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments show that both of our proposed architectures perform favorably against other state-of-the-art vision transformers with similar or even reduced computational complexity. We benchmark our proposed architectures on a number of visual tasks, ranging from image-level classification to pixel-level semantic/instance segmentation and object detection.
Researcher Affiliation Collaboration 1 Meituan Inc. 2 The University of Adelaide, Australia
Pseudocode Yes The Py Torch code of LSA is given in Algorithm 1 (in supplementary).
Open Source Code Yes Our code is available at: https://git.io/Twins.
Open Datasets Yes We first present the Image Net classification results with our proposed models. We test on the ADE20K dataset [42], a challenging scene parsing task for semantic segmentation...We evaluate the performance of our method using two representative frameworks: Retina Net [46] and Mask RCNN [47]. Specifically, we report standard 1φ-schedule (12 epochs) detection results on the COCO 2017 dataset [48].
Dataset Splits Yes This dataset contains 20K images for training and 2K images for validation.
Hardware Specification Yes Throughput is tested on the batch size of 192 on a single V100 GPU.
Software Dependencies No The paper mentions software like PyTorch, Tensor RT, and MMDetection, but does not specify their version numbers for reproducibility.
Experiment Setup Yes All our models are trained for 300 epochs with a batch size of 1024 using the Adam W optimizer [37]. The learning rate is initialized to be 0.001 and decayed to zero within 300 epochs following the cosine strategy. We use a linear warm-up in the first five epochs and the same regularization setting as in [2].