QKFormer: Hierarchical Spiking Transformer using Q-K Attention

Authors: chenlin zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental QKFormer achieves significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on Image Net-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on Image Net-1K.
Researcher Affiliation Collaboration 1Pengcheng Laboratory 2Harbin Institute of Technology 3Peking University
Pseudocode No The paper does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The code and models are available at https://github.com/zhouchenlin2096/QKFormer.
Open Datasets Yes We evaluate QKFormer on static image classification and neuromorphic classification. The former includes Image Net-1K [39], CIFAR10/100 [40]. The latter contains CIFAR10-DVS [41] and DVS128 Gesture [42].
Dataset Splits Yes It contains 1.28 million images for training and 50k images for validation, with a total of 1,000 categories.
Hardware Specification Yes We use 8 NVIDIA Tesla V100 SXM2 32GB GPUs when training models on Image Net, while 1 GPU is used to train other datasets (CIFAR10, CIFAR100, DVS128 Gesture, CIFAR10-DVS).
Software Dependencies No All experiments are implemented based on Pytorch [53], Spiking Jelly [54] and Timm [55]. (Specific version numbers for PyTorch, Spiking Jelly, and Timm are not provided.)
Experiment Setup Yes In this experiment, we use Adam W as the optimizer, which is adopted with a base learning rate of 6 10 4. The actual learning rate was calculated as Batch Size/256 multiplied by the base learning rate. The batch size is set to 512, which is realized by accumulated gradient iterations [33] and distributed across 8 Nvidia V100 GPUs. We trained QKFormer for 200 epochs. In addition, following Dei T [32], data augmentation techniques including Rand Augment [34], random erasing [35], and stochastic depth [36] are employed in this study. The number of blocks in the three stages is set as {1, 2, 7} respectively.