Skip-Attention: Improving Vision Transformers by Paying Less Attention
Authors: Shashanka Venkataramanan, Amir Ghodrati, Yuki M Asano, Fatih Porikli, Amir Habibian
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We show that SKIPAT is agnostic to transformer architecture and is effective in image classification, semantic segmentation, image denoising, and video denoising. We achieve improved throughput at the same-or-higher accuracy levels in all these tasks. |
| Researcher Affiliation | Collaboration | Shashanka Venkataramanan Qualcomm AI Research Amir Ghodrati Qualcomm AI Research Yuki M. Asano University of Amsterdam Fatih Porikli Qualcomm AI Research Amirhossein Habibian Qualcomm AI Research |
| Pseudocode | No | The paper describes the proposed method and its components using mathematical formulations and diagrams (e.g., Figure 4), but it does not include a structured pseudocode or algorithm block. |
| Open Source Code | Yes | Code can be found at https://github.com/Qualcomm-AI-research/ skip-attention |
| Open Datasets | Yes | Image Net-1K: Image classification. We train SKIPAT on the ILSVRC-2012 dataset (Deng et al., 2009) with 1000 classes (referred as Image Net-1K). ... Pascal-VOC2012: Unsupervised object segmentation. We use the Pascal VOC 2012 (Everingham et al.) validation set for this experiment... ADE20K: Semantic segmentation. We evaluate SKIPAT on ADE20K (Zhou et al., 2017)... SIDD: Image denoising. We follow the experimental settings in Uformer (Wang et al., 2022b) and train SKIPAT on the Smartphone Image Denoising Dataset (SIDD) (Abdelhamed et al., 2018a)... DAVIS: Video denoising. We apply our model to the temporal task of video denoising. ... We follow the experimental settings in (Tassano et al., 2020) and train SKIPAT on DAVIS (Pont-Tuset et al., 2017) dataset. |
| Dataset Splits | Yes | To quantify this correlation, we compute the Centered Kernel Alignment (CKA) (Kornblith et al., 2019; Cortes et al., 2012) between A[CLS] i and A[CLS] j for every i, j L across all validation samples of Image Net-1K. ... We use the Pascal VOC 2012 (Everingham et al.) validation set for this experiment, containing 1449 images. ... The dataset includes 20K and 2K images in the training and validation set, respectively. |
| Hardware Specification | Yes | We train baseline Vi T and SKIPAT for 300 epochs from scratch on 4 NVIDIA A100 GPUs... We measure throughput (image/sec) with a batch size of 1024 on a single NVIDIA A100 GPU... We measure its inference time (averaged over 20 iterations) on a Samsung Galaxy S22 device powered by Qualcomm Snapdragon 8 Gen 1 Mobile Platform* with a Qualcomm Hexagon TM processor... FLOPs and throughput are calculated on the input size of 256 256, on a single NVIDIA V100 GPU |
| Software Dependencies | No | The paper mentions using specific software such as "timm library (Wightman, 2019)" and "MMSegmentation repo (Contributors, 2020)" but does not provide specific version numbers for these software components or other key dependencies like PyTorch, TensorFlow, or CUDA. |
| Experiment Setup | Yes | We train baseline Vi T and SKIPAT for 300 epochs from scratch on 4 NVIDIA A100 GPUs using batch sizes of 2048 for Vi T-T and 1024 for Vi T-S and Vi T-B. ... We use Adam W (Loshchilov & Hutter, 2017), with an initial learning rate of 6e 5, weight decay of 1e 2, and linear warmup of 1500 iterations. All models are trained for 160K iterations with a batch size of 16. |