reproducibilityindex.ai

Simplifying Transformer Blocks

Authors: Bobby He, Thomas Hofmann

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Combining signal propagation theory and empirical observations, we motivate modifications that allow many block components to be removed with no loss of training speed, including skip connections, projection or value parameters, sequential sub-blocks and normalisation layers. In experiments on both autoregressive decoder-only and BERT encoder-only models, our simplified transformers emulate the per-update convergence speed and performance of standard transformers, while enjoying 16% faster training throughput, & using 15% fewer parameters.
Researcher Affiliation	Academia	Bobby He & Thomas Hofmann Department of Computer Science, ETH Zurich
Pseudocode	No	The paper includes mathematical equations and block diagrams (Figure 1, 10, 11) to describe the models, but it does not contain any blocks explicitly labeled "Pseudocode" or "Algorithm".
Open Source Code	Yes	Our code for experiments on auto-regressive transformers can be found at https://github. com/bobby-he/simplified_transformers.
Open Datasets	Yes	All experiments in this section use an 18-block 768-width causal decoder-only GPT-like model on the Code Parrot dataset,1 which is sufficiently large that we are in a single epoch regime with minimal generalisation gap (Fig. 2), allowing us to focus on training speed." and "Like Geiping & Goldstein (2023), we train on the Pile dataset (Gao et al., 2020), with a Word Piece tokeniser of vocabulary size 32768, and a sequence length of 128.
Dataset Splits	Yes	Figure 2: Loss of training speed in transformers without attention sub-block skip (He et al., 2023), even with Shaped Attention, Eq. (5), and MLP skips (αFF = 1). (Top) Cross-entropy Loss vs Training Step. V-Skip Init ( SA = 0), train; V-Skip Init ( SA = 0), eval; Pre-LN, train; Pre-LN, eval. The presence of "eval" loss and "eval" dataset in figures and text (e.g., "eval Cross-entropy Loss") implies the use of a validation set, and the source for preprocessing (huggingface course) typically provides standard splits.
Hardware Specification	Yes	All runtime results on Code Parrot were run on a single A5000 GPU." and "Training takes place on a single RTX-2080Ti (with microbatches of size 16), and we use Adam W with weight decay 0.1.
Software Dependencies	No	Our implementation, like Geiping & Goldstein (2023), uses automated operator fusion in Py Torch (Sarofeen et al., 2022)." The paper mentions PyTorch but does not provide a specific version number.
Experiment Setup	Yes	All experiments in this section use an 18-block 768-width causal decoder-only GPT-like model on the Code Parrot dataset... We use a linear decay learning rate (LR) schedule2 with Adam W (Loshchilov & Hutter, 2017), with linear warmup for the first 5% steps. The maximum LR is tuned on training loss, using a logarithmic grid. Additional experimental details are in App. D.