Adder Attention for Vision Transformer

Authors: Han Shu, Jiahao Wang, Hanting Chen, Lin Li, Yujiu Yang, Yunhe Wang

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results on several benchmarks demonstrate that the proposed approach can achieve highly competitive performance to that of the baselines while achieving an about 2 3 reduction on the energy consumption.
Researcher Affiliation Collaboration Han Shu1 Jiahao Wang2 Hanting Chen1,3 Lin Li4 Yujiu Yang2 Yunhe Wang1 1Huawei Noah s Ark Lab 2Tsinghua Shenzhen International Graduate School 3Peking University 4Huawei Technologies
Pseudocode No The paper describes its methods mathematically and textually but does not include any pseudocode or clearly labeled algorithm blocks.
Open Source Code No The paper does not provide any statement about open-sourcing code or a link to a code repository.
Open Datasets Yes We first train a 6 block Dei T-Tiny [25] on the MNIST dataset. We then validate our method through the representative Dei T baselines [25] on CIFAR-10 and CIFAR100 dataset. We also conduct experiments on Image Net dataset.
Dataset Splits No The paper mentions 'CIFAR-10 (CIFAR-100) dataset is composed of 50k different 32 32 training images and 10k test images' but does not specify details for a validation split.
Hardware Specification Yes We use NVIDIA Telsa-V100 GPUs and train baseline model and corresponding adder model for same epochs using Py Torch [22] library for fair comparison. The experiments are conducted on NVIDIA Tesla-V100 GPUs.
Software Dependencies No The paper mentions 'Py Torch [22] library' but does not provide a specific version number for it or any other software dependency.
Experiment Setup Yes We use Adam W optimizer [21] with. The batch size is set as 256. For both model we use Adam W optimizer [21] and cosine learning rate decay policy with an initial learning rate of 0.000125. We use 5 epochs for learning rate warm-up [20] with a 0.05 weight decay rate. For all experiments, the image size is set to be 224 224. Networks are trained for 600 epochs with an initial learning rate of 0.0005 and a cosine learning rate decay. We use 5 epochs for learning rate warm-up [20] with a 0.05 weight decay rate.