reproducibilityindex.ai

Sparse is Enough in Scaling Transformers

Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, LUKASZ KAISER, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1. We pre-train Terraformer on the C4 dataset and ﬁne-tune it on the challenging task of summarizing arxiv articles. Terraformer yields results competitive to the state-of-the-art Big Bird-Pegasus without using the Pegasus loss in pre-training (Table 5).
Researcher Affiliation	Collaboration	Sebastian Jaszczur University of Warsaw Aakanksha Chowdhery Google Research Afroz Mohiuddin Google Research Łukasz Kaiser Open AI Wojciech Gajewski Google Research Henryk Michalewski Google Research Jonni Kanerva Google Research
Pseudocode	No	The paper provides mathematical representations and diagrams (Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code	Yes	The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax.
Open Datasets	Yes	We pre-train Terraformer on the C4 dataset and ﬁne-tune it on the challenging task of summarizing arxiv articles. We pretrain Terraformer on C4 (like in all experiments in this paper) and ﬁne-tuned it on the ar Xiv summarization task. (References [30] for C4 and [6] for arxiv are provided).
Dataset Splits	Yes	The results are obtained by ﬁne-tuning on selected downstream tasks from the GLUE dataset (validation split).
Hardware Specification	No	The paper mentions 'unbatched inference on CPUs' but does not provide specific details on the hardware used, such as GPU/CPU models or memory specifications.
Software Dependencies	Yes	The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax.
Experiment Setup	Yes	The paper provides specific experimental setup details, including dmodel, number of attention heads, attention-sparsity, ff-sparsity (e.g., N=64), dlowrank=64, Gumbel softmax temperature 0.1, probability of argmax use (30%), S=16, F=3 for sparse QKV, SRU dimension (32), and loss sparsity 4.