Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Spark Transformer: Reactivating Sparsity in Transformer FFN and Attention
Authors: Chong You, Kan Wu, Zhipeng Jia, Lin Chen, Srinadh Bhojanapalli, Jiaxian Guo, Utku Evci, Jan Wassenberg, Praneeth Netrapalli, Jeremiah Willcock, Suvinay Subramanian, Felix Chern, Alek Andreev, Shreya Pathak, Felix Yu, Prateek Jain, David Culler, Henry Levy, Sanjiv Kumar
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Pretrained with the Gemma-2 recipe, Spark Transformer demonstrates competitive performance on standard benchmarks while exhibiting significant sparsity: only 8% of FFN neurons are activated, and each token attends to a maximum of 256 tokens. This translates to a 2.5 reduction in FLOPs, leading to decoding wall-time speedups of up to 1.79 on CPU and 1.40 on GPU. In this section, we present an experimental evaluation of Spark Transformer using the Gemma-2 2B model. |
| Researcher Affiliation | Industry | Equal contribution Now at x AI Now at Anthropic. (Referring to the gemma.cpp tool) Google/gemma.cpp: lightweight, standalone c++ inference engine for google s gemma models. https://github.com/google/gemma.cpp, 2025. |
| Pseudocode | No | The paper describes the methodology using mathematical equations and textual explanations (e.g., Section 2: Spark Transformer, Section 3: Statistical Top-k) and diagrams (e.g., Figure 2), but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | For CPU evaluation, we use gemma.cpp [33], the official C++ inference engine optimized for CPUs. For GPU evaluation, we use llama.cpp [31], a widely-used LLM inference engine which supports running a wide selection of LLM models on GPUs. We modify both frameworks to support sparse matrix multiplication operators... (From NeurIPS Checklist): Training data and code for Gemma-2 are proprietary. |
| Open Datasets | No | Pretrained with the Gemma-2 recipe... We evaluate Spark Transformer on a suite of benchmarks that are used in the Gemma-2 paper [30]... (From NeurIPS Checklist): Training data and code for Gemma-2 are proprietary. The paper does not provide concrete access information for datasets. |
| Dataset Splits | No | Pretrained with the Gemma-2 recipe... Gemma-2 2B is a decoder-only Transformer with 2 billion parameters, pretrained on 2 trillion tokens of primarily English text data (see [30] for details). The paper refers to the 'Gemma-2 recipe' for training but does not explicitly provide details about training, validation, or test dataset splits. |
| Hardware Specification | Yes | For CPU evaluation, we use gemma.cpp [33], the official C++ inference engine optimized for CPUs... Figure 5a and 5b report the decoding speed under varying prompt lengths on a 4-Core or a 16-Core CPU... Figure 5c reports the decoding speed under varying prompt lengths on an NVIDIA T4 GPU. |
| Software Dependencies | No | For CPU evaluation, we use gemma.cpp [33], the official C++ inference engine optimized for CPUs. For GPU evaluation, we use llama.cpp [31], a widely-used LLM inference engine... high-precision piecewise approximation algorithms with constant complexity are available in standard software packages like Sci Py [86]. The paper mentions software components but does not provide specific version numbers. |
| Experiment Setup | Yes | Implementation details. Gemma-2 uses a model dimension of dmodel = 2304. For FFN, Gemma-2 uses the Gated FFN in eq. (5) with d ff = 9216. We replace it with Spark FFN in eq. (2) with dff = 13824 so that the parameter count keeps the same. In addition, we take k to be 1106, which gives a sparsity level of 8%, and r = 1024 dmodel/2 (due to sharding constraints, r can only be a multiple of 256). For Attention, Gemma-2 alternates between a global attention that have a span of 8192 tokens, and a local attention with a 4096 window size, both with dattn = 256. We replace both with Spark Attention in eq. (6) where for the latter we use the same 4096 window size. For hyper-parameters, we use k = 256, i.e. each token attends to at most 256 tokens, and r = 128 = dattn/2. |