Sparse is Enough in Scaling Transformers
Authors: Sebastian Jaszczur, Aakanksha Chowdhery, Afroz Mohiuddin, LUKASZ KAISER, Wojciech Gajewski, Henryk Michalewski, Jonni Kanerva
NeurIPS 2021 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The improvement in complexity holds not just asymptotically but yields over 2.6x speedup in wall-clock hed decoding time already for a model with 800M parameters and 20x improvement for a model with 17B parameters, as shown in Table 1. We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. Terraformer yields results competitive to the state-of-the-art Big Bird-Pegasus without using the Pegasus loss in pre-training (Table 5). |
| Researcher Affiliation | Collaboration | Sebastian Jaszczur University of Warsaw Aakanksha Chowdhery Google Research Afroz Mohiuddin Google Research Łukasz Kaiser Open AI Wojciech Gajewski Google Research Henryk Michalewski Google Research Jonni Kanerva Google Research |
| Pseudocode | No | The paper provides mathematical representations and diagrams (Figure 2), but no explicitly labeled 'Pseudocode' or 'Algorithm' blocks. |
| Open Source Code | Yes | The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax. |
| Open Datasets | Yes | We pre-train Terraformer on the C4 dataset and fine-tune it on the challenging task of summarizing arxiv articles. We pretrain Terraformer on C4 (like in all experiments in this paper) and fine-tuned it on the ar Xiv summarization task. (References [30] for C4 and [6] for arxiv are provided). |
| Dataset Splits | Yes | The results are obtained by fine-tuning on selected downstream tasks from the GLUE dataset (validation split). |
| Hardware Specification | No | The paper mentions 'unbatched inference on CPUs' but does not provide specific details on the hardware used, such as GPU/CPU models or memory specifications. |
| Software Dependencies | Yes | The code is open-sourced as part of Trax 1.4.0 at https://github.com/google/trax. |
| Experiment Setup | Yes | The paper provides specific experimental setup details, including dmodel, number of attention heads, attention-sparsity, ff-sparsity (e.g., N=64), dlowrank=64, Gumbel softmax temperature 0.1, probability of argmax use (30%), S=16, F=3 for sparse QKV, SRU dimension (32), and loss sparsity 4. |