Transformer Quality in Linear Time

Authors: Weizhe Hua, Zihang Dai, Hanxiao Liu, Quoc Le

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to demonstrate the efficacy of FLASH over a variety of tasks (masked and autoregressive language modeling), datasets (C4, Wiki-40B, PG19) and model scales (110M to 500M).
Researcher Affiliation Collaboration 1Cornell University 2Google Research, Brain Team.
Pseudocode Yes Figure 2: (c) Pseudocode for Gated Attention Unit. Code 1: Pseudocode for mixed chunk attention.
Open Source Code No The paper provides pseudocode but does not explicitly state that the source code for the methodology is openly available or provide a link to a repository.
Open Datasets Yes We pretrain and evaluate all models on the C4 dataset (Raffel et al., 2020). ... For auto-regressive language modeling, we focus on the Wiki-40B (Guo et al., 2020) and PG-19 (Rae et al., 2019) datasets...
Dataset Splits No The paper refers to 'validation-set results' in figure captions and discusses model training, but it does not explicitly provide the specific percentages or counts for training, validation, and test splits needed for reproduction.
Hardware Specification Yes Figure 1: TPU-v4 training speedup of FLASH... The training speed of each model (i.e., training latency per step) is measured with 64 TPU-v4 cores... using a single Nvidia Tesla V100 GPU
Software Dependencies No The paper mentions 'Tensor Flow Profiler' and includes TensorFlow in its pseudocode, but it does not specify version numbers for any software dependencies.
Experiment Setup Yes Appendix B.1. Hyperparameters. ...Table 6: Hyperparameters for MLM pretraining on C4. ...Table 7: Hyperparameters for LM pretraining on Wiki-40B and PG-19. ... These tables specify details such as 'Tokens per batch', 'Batch size', 'Number of steps', 'Warmup steps', 'Peak learning rate', 'Optimizer', 'Weight decay', 'Dropout', and 'Chunk size'.