Spiking Transformer with Experts Mixture
Authors: Zhaokun Zhou, Yijie Lu, Yanhao Jia, Kaiwei Che, Jun Niu, Liwei Huang, Xinyu Shi, Yuesheng Zhu, Guoqi Li, Zhaofei Yu, Li Yuan
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that SEMM realizes sparse conditional computation and obtains a stable improvement on neuromorphic and static datasets with approximate computational overhead based on the Spiking Transformer baselines. |
| Researcher Affiliation | Academia | 1School of Electronic and Computer Engineering, Shenzhen Graduate School, Peking University 2Peng Cheng Laboratory 3 College of Computing and Data Science, Nanyang Technological University 4School of Computer Science, Peking University 5Institute for Artificial Intelligence, Peking University 6Institute of Automation, Chinese Academy of Sciences 7Deep Neuro Cognition Lab, I2R and CFAR, Agency for Science, Technology and Research |
| Pseudocode | No | The paper describes its methods using equations and text but does not include a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material? Answer: [Yes] Justification: Please see the codes in the supplementary material. |
| Open Datasets | Yes | We conduct experiments on static datasets, i.e., Image Net [40] and CIFAR [41], and neuromorphic datasets, i.e., CIFAR10-DVS [42], DVS128 Gesture [43] to verify the effectiveness of SEMM. |
| Dataset Splits | Yes | We analyze the average spiking rate (ASR) of routers for EMSA and EMSP on the Image Net validation set, which is shown in Tab. 2. |
| Hardware Specification | Yes | In our experiments, we use 8 NVIDIA-4090 GPUs for Image Net, and 1 NVIDIA-4090 GPU for other datasets. |
| Software Dependencies | No | The paper mentions using a Sigmoid function as the surrogate function and other components like AdamW, but does not provide specific version numbers for software libraries or frameworks (e.g., PyTorch version, CUDA version). |
| Experiment Setup | Yes | For the Image Net-1K, we utilize a fixed timestep count of T = 4. The optimizer is the Adam W, with a batch size of 128 or 256 over the course of 310 training epochs. The learning rate is governed by a cosine-decay schedule, starting from an initial value of 0.0004. We incorporate a suite of standard data augmentation techniques, including random augmentation, mixup, and cutmix, into our training. For four small datasets, we adapt the SEMM to a variety of baseline models, following the precedents set by [11, 12]. For CIFAR, we maintain a timestep count of T = 4. For the neuromorphic datasets, we increase this to T = 10 and T = 16, respectively. Our experimental setup is consistent with each Spiking Transformer baselines, as detailed below. |