reproducibilityindex.ai

Less Is More: Pay Less Attention in Vision Transformers

Authors: Zizheng Pan, Bohan Zhuang, Haoyu He, Jing Liu, Jianfei Cai2035-2043

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation, serving as a strong backbone for many vision tasks. ... We conduct extensive experiments to show that the proposed LIT performs favorably against several state-of-the-art vision Transformers with similar or even reduced computational complexity and memory consumption.
Researcher Affiliation	Academia	Data Science & AI, Monash University, Australia {zizheng.pan, bohan.zhuang, haoyu.he, jing.liu1, jianfei.cai}@monash.edu
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code is available at https://github.com/zip-group/LIT.
Open Datasets	Yes	We conduct experiments on Image Net (ILSVRC2012) (Russakovsky et al. 2015) dataset. ... we conduct experiments on COCO 2017 (Lin et al. 2014) dataset... We conduct experiments on ADE20K (Zhou et al. 2019)...
Dataset Splits	Yes	Image Net is a large-scale dataset which has 1.2M training images from 1K categories and 50K validation images. ... COCO is a large-scale dataset which contains 118K images for the training set and 5K images for the validation set.
Hardware Specification	Yes	Throughput (imgs/s) is measured on one NVIDIA RTX 3090 GPU, with a batch size of 64 and averaged over 30 runs. ... All models are trained on 8 V100 GPUs, with 1 schedule (12 epochs) and a total batch size of 16.
Software Dependencies	No	The paper mentions 'Adam W optimizer' but does not specify version numbers for any software dependencies like PyTorch or CUDA.
Experiment Setup	Yes	In general, all models are trained on Image Net with 300 epochs and a total batch size of 1024. For all Image Net experiments, training images are resized to 256 x 256, and 224 x 224 patches are randomly cropped from an image or its horizontal flip, with the per-pixel mean subtracted. We use Adam W optimizer (Loshchilov and Hutter 2019) with a cosine decay learning rate scheduler. The initial learning rate is 1e-3, and the weight decay is set to 5e-2. The initial values of learnable offsets in DTM are set to 0, and the initial learning rate for offset parameters is set to 1e-5. ... All models are trained on 8 V100 GPUs, with 1 schedule (12 epochs) and a total batch size of 16. We use Adam W (Loshchilov and Hutter 2019) optimizer with a step decay learning rate scheduler. Following PVT (Wang et al. 2021), the initial learning rates are set to 1e-4 and 2e-4 for Retina Net and Mask R-CNN, respectively. The weight decay is set to 1e-4 for all models.