Fast Vision Transformers with HiLo Attention

Authors: Zizheng Pan, Jianfei Cai, Bohan Zhuang

NeurIPS 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct image classification experiments on Image Net-1K [43], a large-scale image dataset which contains 1.2M training images and 50K validation images from 1K categories. We measure the model performance by Top-1 accuracy. Furthermore, we report the FLOPs, throughput, as well as training/test memory consumption on GPUs.
Researcher Affiliation Academia Zizheng Pan Jianfei Cai Bohan Zhuang Department of Data Science & AI, Monash University, Australia
Pseudocode No The paper does not contain any clearly labeled pseudocode or algorithm blocks.
Open Source Code Yes Code is available at https://github.com/ziplab/LITv2.
Open Datasets Yes We conduct image classification experiments on Image Net-1K [43], a large-scale image dataset which contains 1.2M training images and 50K validation images from 1K categories.
Dataset Splits Yes We conduct image classification experiments on Image Net-1K [43], a large-scale image dataset which contains 1.2M training images and 50K validation images from 1K categories.
Hardware Specification Yes Throughput is tested on one NVIDIA RTX 3090 GPU and averaged over 30 runs. (Table 1 footnote). Evaluations are based on a batch size of 64 on one RTX 3090 GPU. (Figure 3 caption). Intel Core i9-10900X CPU @ 3.70GHz and NVIDIA Ge Force RTX 3090 (Figure 6).
Software Dependencies No The paper mentions the 'mmdetection [4] framework' but does not provide specific version numbers for any software dependencies, libraries, or programming languages used for reproducibility.
Experiment Setup Yes All models are trained for 300 epochs from scratch on 8 V100 GPUs. At training time, we set the total batch size as 1,024. The input images are resized and randomly cropped into 224 224. The initial learning rate is set to 1 10 3 and the weight decay is set to 5 10 2. We use Adam W optimizer with a cosine decay learning rate scheduler.