Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Sparse is Enough in Scaling Transformers
Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, LUKASZ KAISER, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
NeurIPS 2021 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1. We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. Terraformer yields results competitive to the state-of-the-art Big Bird-Pegasus without using the Pegasus loss in pre-training (Table 5). |
| Researcher Affiliation | Collaboration | Sebastian Jaszczur University of Warsaw Aakanksha Chowdhery Google Research Afroz Mohiuddin Google Research Łukasz Kaiser Open AI Wojciech Gajewski Google Research Henryk Michalewski Google Research Jonni Kanerva Google Research |
| Pseudocode | No | The paper provides mathematical representations and diagrams (Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax. |
| Open Datasets | Yes | We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. We pretrain Terraformer on C4 (like in all experiments in this paper) and fine-tuned it on the ar Xiv summarization task. (References [30] for C4 and [6] for arxiv are provided). |
| Dataset Splits | Yes | The results are obtained by fine-tuning on selected downstream tasks from the GLUE dataset (validation split). |
| Hardware Specification | No | The paper mentions 'unbatched inference on CPUs' but does not provide specific details on the hardware used, such as GPU/CPU models or memory specifications. |
| Software Dependencies | Yes | The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax. |
| Experiment Setup | Yes | The paper provides specific experimental setup details, including dmodel, number of attention heads, attention-sparsity, ff-sparsity (e.g., N=64), dlowrank=64, Gumbel softmax temperature 0.1, probability of argmax use (30%), S=16, F=3 for sparse QKV, SRU dimension (32), and loss sparsity 4. |