Adder Attention for Vision Transformer
Authors: Han Shu, Jiahao Wang, Hanting Chen, Lin Li, Yujiu Yang, Yunhe Wang
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results on several benchmarks demonstrate that the proposed approach can achieve highly competitive performance to that of the baselines while achieving an about 2 3 reduction on the energy consumption. |
| Researcher Affiliation | Collaboration | Han Shu1 Jiahao Wang2 Hanting Chen1,3 Lin Li4 Yujiu Yang2 Yunhe Wang1 1Huawei Noah s Ark Lab 2Tsinghua Shenzhen International Graduate School 3Peking University 4Huawei Technologies |
| Pseudocode | No | The paper describes its methods mathematically and textually but does not include any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | No | The paper does not provide any statement about open-sourcing code or a link to a code repository. |
| Open Datasets | Yes | We first train a 6 block Dei T-Tiny [25] on the MNIST dataset. We then validate our method through the representative Dei T baselines [25] on CIFAR-10 and CIFAR100 dataset. We also conduct experiments on Image Net dataset. |
| Dataset Splits | No | The paper mentions 'CIFAR-10 (CIFAR-100) dataset is composed of 50k different 32 32 training images and 10k test images' but does not specify details for a validation split. |
| Hardware Specification | Yes | We use NVIDIA Telsa-V100 GPUs and train baseline model and corresponding adder model for same epochs using Py Torch [22] library for fair comparison. The experiments are conducted on NVIDIA Tesla-V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Py Torch [22] library' but does not provide a specific version number for it or any other software dependency. |
| Experiment Setup | Yes | We use Adam W optimizer [21] with. The batch size is set as 256. For both model we use Adam W optimizer [21] and cosine learning rate decay policy with an initial learning rate of 0.000125. We use 5 epochs for learning rate warm-up [20] with a 0.05 weight decay rate. For all experiments, the image size is set to be 224 224. Networks are trained for 600 epochs with an initial learning rate of 0.0005 and a cosine learning rate decay. We use 5 epochs for learning rate warm-up [20] with a 0.05 weight decay rate. |