Memorizing Transformers
Authors: Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate the effect of adding external memory on five language modeling tasks, all of which involve long-form text: English language books (PG-19), long web articles (C4), technical math papers (ar Xiv Math), source code (Github), and formal theorems (Isabelle). The results show significant improvements in the perplexity of the model with the addition of external memory. |
| Researcher Affiliation | Industry | {yuhuai,mrabe,delesley,szegedy}@google.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | No | We plan to release our code as open source. |
| Open Datasets | Yes | The datasets for C4 and PG-19 are publicly available. Our additional datasets, Github, Isabelle, and Ar Xiv Math are derived from publicly available data buckets, which we link in the main part of the paper. |
| Dataset Splits | No | The paper describes how documents are split into subsequences for training steps but does not provide explicit train, validation, or test dataset splits (e.g., as percentages or sample counts) for its experiments. |
| Hardware Specification | Yes | We ran all of our experiments on 32 TPU cores. ... measured on TPUv3. |
| Software Dependencies | No | The paper mentions software like JAX, Flax, Sentencepiece, and Adafactor optimizer, but does not provide specific version numbers for them. |
| Experiment Setup | Yes | We used a 12-layer decoder-only transformer (with and without Transformer-XL cache) with an embedding size of 1024, 8 attention heads of dimension 128, and an FFN hidden layer of size 4096. For all of our experiments, we used k = 32. Unless specified otherwise, we use the 9th layer as the k NN augmented attention layer. We used the Adafactor optimizer (Shazeer & Stern, 2018). In preliminary experiments, we conducted a hyperparameter search to determine the optimal learning rate among three choices ({3.0, 1.0, 3 10 1}), and found that 1.0 works best. We used a linear warmup schedule for the first 1000 steps, followed by square root decay. We trained the models from scratch for 500K steps on all the datasets... |