SpikeLM: Towards General Spike-Driven Language Modeling via Elastic Bi-Spiking Mechanisms

Authors: Xingrun Xing, Zheng Zhang, Ziyi Ni, Shitao Xiao, Yiming Ju, Siqi Fan, Yequan Wang, Jiajun Zhang, Guoqi Li

ICML 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate previous SNNs and Spike LMs on a range of general language tasks, including discriminative and generative ones.
Researcher Affiliation Academia 1Institute of Automation, Chinese Academy of Sciences 2School of Artificial Intelligence, University of Chinese Academy of Sciences 3Beijing Academy of Artificial Intelligence.
Pseudocode No No structured pseudocode or algorithm blocks are present.
Open Source Code Yes Our code is available at https://github.com/Xingrun Xing/Spike LM.
Open Datasets Yes In pretraining, we use the Books Corpus (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2018) as training data, including 800M and 2500M words respectively. In finetuning, we use the GLUE benchmark training with the common settings of ANNs.
Dataset Splits Yes We follow the standard ANN-based BERT to develop SNN-based LIF-BERT and Spike LM, which include two stages: pretraining and finetuning. In pretraining, we use the Books Corpus (Zhu et al., 2015) and English Wikipedia (Devlin et al., 2018) as training data, including 800M and 2500M words respectively. In finetuning, we use the GLUE benchmark training with the common settings of ANNs.
Hardware Specification Yes All SNN models are trained on a single node with 8 A800 GPUs.
Software Dependencies No The paper mentions 'Py Torch' and 'Spikingjelly' but does not provide specific version numbers for these software dependencies.
Experiment Setup Yes We set the maximum length of each sentence as 128 tokens. The batch size is set to 512 in training. The entire pretraining encompasses a total of 10^5 steps. The same as ANN conditions, we train SNNs with an Adam W optimizer with a 2 * 10^-4 peak learning rate and 0.01 weight decay. We adapt the learning rate by a linear schedule with 5000 warm-up steps. ... we maintain a constant learning rate of 2 * 10^-5 and a batch size of 32 for all subsets. ... For XSUM, CNN-Daily Mail, and WMT16 datasets, we use the Adam W optimizer and train 20 epochs with a 128 batch size, and a peak learning rate of 3.5 * 10^-4, 7 * 10^-4, or 1 * 10^-4 respectively.