Simplifying Transformer Blocks
Authors: Bobby He, Thomas Hofmann
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update convergence speed and performance of standard transformers, while enjoying 16% faster training throughput, & using 15% fewer parameters. |
| Researcher Affiliation | Academia | Bobby He & Thomas Hofmann Department of Computer Science, ETH Zurich |
| Pseudocode | No | The paper includes mathematical equations and block diagrams (Figure 1, 10, 11) to describe the models, but it does not contain any blocks explicitly labeled "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Our code for experiments on auto-regressive transformers can be found at https://github. com/bobby-he/simplified_transformers. |
| Open Datasets | Yes | All experiments in this section use an 18-block 768-width causal decoder-only GPT-like model on the Code Parrot dataset,1 which is sufficiently large that we are in a single epoch regime with minimal generalisation gap (Fig. 2), allowing us to focus on training speed." and "Like Geiping & Goldstein (2023), we train on the Pile dataset (Gao et al., 2020), with a Word Piece tokeniser of vocabulary size 32768, and a sequence length of 128. |
| Dataset Splits | Yes | Figure 2: Loss of training speed in transformers without attention sub-block skip (He et al., 2023), even with Shaped Attention, Eq. (5), and MLP skips (αFF = 1). (Top) Cross-entropy Loss vs Training Step. V-Skip Init ( SA = 0), train; V-Skip Init ( SA = 0), eval; Pre-LN, train; Pre-LN, eval. The presence of "eval" loss and "eval" dataset in figures and text (e.g., "eval Cross-entropy Loss") implies the use of a validation set, and the source for preprocessing (huggingface course) typically provides standard splits. |
| Hardware Specification | Yes | All runtime results on Code Parrot were run on a single A5000 GPU." and "Training takes place on a single RTX-2080Ti (with microbatches of size 16), and we use Adam W with weight decay 0.1. |
| Software Dependencies | No | Our implementation, like Geiping & Goldstein (2023), uses automated operator fusion in Py Torch (Sarofeen et al., 2022)." The paper mentions PyTorch but does not provide a specific version number. |
| Experiment Setup | Yes | All experiments in this section use an 18-block 768-width causal decoder-only GPT-like model on the Code Parrot dataset... We use a linear decay learning rate (LR) schedule2 with Adam W (Loshchilov & Hutter, 2017), with linear warmup for the first 5% steps. The maximum LR is tuned on training loss, using a logarithmic grid. Additional experimental details are in App. D. |