BVT-IMA: Binary Vision Transformer with Information-Modified Attention
Authors: Zhenyu Wang, Hao Luo, Xuemei Xie, Fan Wang, Guangming Shi
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive experiments on CIFAR-100/Tiny Image Net/Image Net-1k demonstrate the effectiveness of the proposed information-modified attention on binary vision transformers. |
| Researcher Affiliation | Collaboration | 1Hangzhou Institute of Technology, Xidian University, Hangzhou 311200, China 2Guangzhou Institute of Technology, Xidian University, Guangzhou 510700, China 3Pazhou Lab, Huangpu, 510555, China 4DAMO Academy, Alibaba group, 310023, Hangzhou, China 5Hupan Lab, 310023, Hangzhou, China zy wang1995@outlook.com, {michuan.lh, fan.w}@alibaba-inc.com, xmxie@mail.xidian.edu.cn, gmshi@xidian.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code will be uploaded to https://github.com/Daner-Wang/ BVT-IMA.git. |
| Open Datasets | Yes | The proposed method is applied to popular Vi T models (Dei T (Touvron et al. 2021), Swin (Liu et al. 2021a), and Nes T (Zhang et al. 2022c)) and evaluated on the CIFAR-100 (Krizhevsky and Hinton 2009)/Tiny Image Net (Pouransari and Ghili 2014)/Image Net-1k (Russakovsky et al. 2015) benchmark of 100/200/1000 classes. |
| Dataset Splits | No | The paper mentions training, but does not explicitly detail the exact train/validation/test splits (e.g., percentages or sample counts for each split), only referring to standard benchmarks without specifying how they were partitioned for the experiments. |
| Hardware Specification | Yes | All experiments are implemented with the Pytorch (Paszke et al. 2019) and TIMM library on NVIDIA-V100 GPUs. |
| Software Dependencies | No | The paper mentions 'Pytorch (Paszke et al. 2019) and TIMM library' but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | Two-stage training is adopted, in which only weights are binarized in the first stage. Activation binarization and information tables are adopted in the second stage. The Adam optimizer without weight decay is employed. The cosine annealing schedule with 5 epochs of warm-up is applied to adjust the learning rate initialized to 5e 4. For Image Net-1k, a two steps binarization scheme is adopted during warmup to help models to converge on the complex dataset. The knowledge distillation is adopted for each quantized model to learn from its corresponding real-value teacher with the cross entropy loss function and 0.5 distillation factor. Models are trained for 300/150/150 epochs in each stage on CIFAR-100/Tiny Image Net/Image Net-1k. Limited by the GPU memory, the batch size is 128 for Dei T-Tiny in both stages while 128/64 for the other models in the first/second stage. The data augmentation and other hyper-parameters are the same as those in Dei T. |