Less Is More: Pay Less Attention in Vision Transformers
Authors: Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai2035-2043
AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. ... We conduct extensive experiments to show that the proposed LIT performs favorably against several state-of-the-art vision Transformers with similar or even reduced computational complexity and memory consumption. |
| Researcher Affiliation | Academia | Data Science & AI, Monash University, Australia {zizheng.pan, bohan.zhuang, haoyu.he, jing.liu1, jianfei.cai}@monash.edu |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/zip-group/LIT. |
| Open Datasets | Yes | We conduct experiments on Image Net (ILSVRC2012) (Russakovsky et al. 2015) dataset. ... we conduct experiments on COCO 2017 (Lin et al. 2014) dataset... We conduct experiments on ADE20K (Zhou et al. 2019)... |
| Dataset Splits | Yes | Image Net is a large-scale dataset which has 1.2M training images from 1K categories and 50K validation images. ... COCO is a large-scale dataset which contains 118K images for the training set and 5K images for the validation set. |
| Hardware Specification | Yes | Throughput (imgs/s) is measured on one NVIDIA RTX 3090 GPU, with a batch size of 64 and averaged over 30 runs. ... All models are trained on 8 V100 GPUs, with 1 schedule (12 epochs) and a total batch size of 16. |
| Software Dependencies | No | The paper mentions 'Adam W optimizer' but does not specify version numbers for any software dependencies like PyTorch or CUDA. |
| Experiment Setup | Yes | In general, all models are trained on Image Net with 300 epochs and a total batch size of 1024. For all Image Net experiments, training images are resized to 256 x 256, and 224 x 224 patches are randomly cropped from an image or its horizontal flip, with the per-pixel mean subtracted. We use Adam W optimizer (Loshchilov and Hutter 2019) with a cosine decay learning rate scheduler. The initial learning rate is 1e-3, and the weight decay is set to 5e-2. The initial values of learnable offsets in DTM are set to 0, and the initial learning rate for offset parameters is set to 1e-5. ... All models are trained on 8 V100 GPUs, with 1 schedule (12 epochs) and a total batch size of 16. We use Adam W (Loshchilov and Hutter 2019) optimizer with a step decay learning rate scheduler. Following PVT (Wang et al. 2021), the initial learning rates are set to 1e-4 and 2e-4 for Retina Net and Mask R-CNN, respectively. The weight decay is set to 1e-4 for all models. |