Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpikeGPT: Generative Pre-trained Language Model with Spiking Neural Networks

Authors: Rui-Jie Zhu, Qihang Zhao, Guoqi Li, Jason Eshraghian

TMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments show that Spike GPT remains competitive with non-spiking models on tested benchmarks, while maintaining 32.2 fewer operations when processed on neuromorphic hardware that can leverage sparse, event-driven activations. Our experiments show that Spike GPT achieves competitive performance on all tested datasets while consuming significantly less energy compared to traditional ANN models. We test two variants of the 46 million parameter model; one where T = 1, 024 and another where T = 3, 072. We used the Enwik8 dataset to conduct both training and testing in 46M scale, and our most extensive model with 216 million parameters was trained using the Open Web Text2 (Gao et al., 2020) corpus for pre-training. We evaluated Spike GPT on two major language-related tasks: Natural Language Generation (NLG) and Natural Language Understanding (NLU).
Researcher Affiliation	Collaboration	Rui-Jie Zhu EMAIL Department of Electrical and Computer Engineering University of California, Santa Cruz Qihang Zhao EMAIL Kuaishou Guoqi Li EMAIL Institute of Automation Chinese Academy of Sciences Jason K. Eshraghian EMAIL Department of Electrical and Computer Engineering University of California, Santa Cruz
Pseudocode	No	The paper describes the model architecture and components using mathematical equations and diagrams (Figure 1, equations 1-5, 7-10) but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	Our code implementation is available at https://github.com/ ridgerchu/Spike GPT.
Open Datasets	Yes	We used the Enwik8 dataset to conduct both training and testing in 46M scale, and our most extensive model with 216 million parameters was trained using the Open Web Text2 (Gao et al., 2020) corpus for pre-training. We evaluated the text generation performance of Spike GPT on three classic text generation datasets: Enwik8 (Mahoney, 2011), Wiki Text-2 (Merity et al., 2017), and Wiki Text-103 (Merity et al., 2017). For NLU tasks, we chose the following 4 classic text classification datasets to evaluate the performance of our proposed Spike GPT: MR (Pang & Lee, 2005), SST-5 (Socher et al., 2013), SST-2 (Socher et al., 2013), Subj. (Pang & Lee, 2004)
Dataset Splits	Yes	We split the tokens into three subsets: 90% for training, 5% for validation, and 5% for testing. If there is no standard training test segmentation, we will follow Lv et al. (2023b) and randomly select 10% of the samples from the entire dataset as the test set.
Hardware Specification	Yes	All experiments were conducted on four NVIDIA V100 graphic cards.
Software Dependencies	No	Our implementation is based on Py Torch (Paszke et al., 2019) and Spiking Jelly (Fang et al., 2020). While PyTorch and SpikingJelly are mentioned, specific version numbers for these software dependencies are not provided in the text.
Experiment Setup	Yes	To mitigate the issue of overfitting, we incorporate dropout after the output of each SRFFN block and set the dropout ratio to 0.03. We employ the Byte Pair Encoding (BPE) tokenizer and share the same hyper-parameters as GPT-Neo X (Black et al., 2022). To facilitate better convergence, we utilize a warmup technique during the first 500 training steps. For both the 46M and 216M models, we use the Adam optimizer, and set the learning rate to 6 × 10−4 and 4 × 10−4, respectively. For the models of 46M and 216M, we trained them for 12 and 48 hours respectively.