Transformer Quality in Linear Time
Authors: Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc Le
ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct extensive experiments to demonstrate the efficacy of FLASH over a variety of tasks (masked and autoregressive language modeling), datasets (C4, Wiki-40B, PG19) and model scales (110M to 500M). |
| Researcher Affiliation | Collaboration | 1Cornell University 2Google Research, Brain Team. |
| Pseudocode | Yes | Figure 2: (c) Pseudocode for Gated Attention Unit. Code 1: Pseudocode for mixed chunk attention. |
| Open Source Code | No | The paper provides pseudocode but does not explicitly state that the source code for the methodology is openly available or provide a link to a repository. |
| Open Datasets | Yes | We pretrain and evaluate all models on the C4 dataset (Raffel et al., 2020). ... For auto-regressive language modeling, we focus on the Wiki-40B (Guo et al., 2020) and PG-19 (Rae et al., 2019) datasets... |
| Dataset Splits | No | The paper refers to 'validation-set results' in figure captions and discusses model training, but it does not explicitly provide the specific percentages or counts for training, validation, and test splits needed for reproduction. |
| Hardware Specification | Yes | Figure 1: TPU-v4 training speedup of FLASH... The training speed of each model (i.e., training latency per step) is measured with 64 TPU-v4 cores... using a single Nvidia Tesla V100 GPU |
| Software Dependencies | No | The paper mentions 'Tensor Flow Profiler' and includes TensorFlow in its pseudocode, but it does not specify version numbers for any software dependencies. |
| Experiment Setup | Yes | Appendix B.1. Hyperparameters. ...Table 6: Hyperparameters for MLM pretraining on C4. ...Table 7: Hyperparameters for LM pretraining on Wiki-40B and PG-19. ... These tables specify details such as 'Tokens per batch', 'Batch size', 'Number of steps', 'Warmup steps', 'Peak learning rate', 'Optimizer', 'Weight decay', 'Dropout', and 'Chunk size'. |