Sparse is Enough in Scaling Transformers

Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, LUKASZ KAISER, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva

NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1. We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. Terraformer yields results competitive to the state-of-the-art Big Bird-Pegasus without using the Pegasus loss in pre-training (Table 5).
Researcher Affiliation Collaboration Sebastian Jaszczur University of Warsaw Aakanksha Chowdhery Google Research Afroz Mohiuddin Google Research Łukasz Kaiser Open AI Wojciech Gajewski Google Research Henryk Michalewski Google Research Jonni Kanerva Google Research
Pseudocode No The paper provides mathematical representations and diagrams (Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax.
Open Datasets Yes We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. We pretrain Terraformer on C4 (like in all experiments in this paper) and fine-tuned it on the ar Xiv summarization task. (References [30] for C4 and [6] for arxiv are provided).
Dataset Splits Yes The results are obtained by fine-tuning on selected downstream tasks from the GLUE dataset (validation split).
Hardware Specification No The paper mentions 'unbatched inference on CPUs' but does not provide specific details on the hardware used, such as GPU/CPU models or memory specifications.
Software Dependencies Yes The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax.
Experiment Setup Yes The paper provides specific experimental setup details, including dmodel, number of attention heads, attention-sparsity, ff-sparsity (e.g., N=64), dlowrank=64, Gumbel softmax temperature 0.1, probability of argmax use (30%), S=16, F=3 for sparse QKV, SRU dimension (32), and loss sparsity 4.