Memorizing Transformers

Authors: Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate the effect of adding external memory on five language modeling tasks, all of which involve long-form text: English language books (PG-19), long web articles (C4), technical math papers (ar Xiv Math), source code (Github), and formal theorems (Isabelle). The results show significant improvements in the perplexity of the model with the addition of external memory.
Researcher Affiliation Industry {yuhuai,mrabe,delesley,szegedy}@google.com
Pseudocode No The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code No We plan to release our code as open source.
Open Datasets Yes The datasets for C4 and PG-19 are publicly available. Our additional datasets, Github, Isabelle, and Ar Xiv Math are derived from publicly available data buckets, which we link in the main part of the paper.
Dataset Splits No The paper describes how documents are split into subsequences for training steps but does not provide explicit train, validation, or test dataset splits (e.g., as percentages or sample counts) for its experiments.
Hardware Specification Yes We ran all of our experiments on 32 TPU cores. ... measured on TPUv3.
Software Dependencies No The paper mentions software like JAX, Flax, Sentencepiece, and Adafactor optimizer, but does not provide specific version numbers for them.
Experiment Setup Yes We used a 12-layer decoder-only transformer (with and without Transformer-XL cache) with an embedding size of 1024, 8 attention heads of dimension 128, and an FFN hidden layer of size 4096. For all of our experiments, we used k = 32. Unless specified otherwise, we use the 9th layer as the k NN augmented attention layer. We used the Adafactor optimizer (Shazeer & Stern, 2018). In preliminary experiments, we conducted a hyperparameter search to determine the optimal learning rate among three choices ({3.0, 1.0, 3 10 1}), and found that 1.0 works best. We used a linear warmup schedule for the first 1000 steps, followed by square root decay. We trained the models from scratch for 500K steps on all the datasets...