reproducibilityindex.ai

Memorizing Transformers

Authors: Yuhuai Wu, Markus Norman Rabe, DeLesley Hutchins, Christian Szegedy

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate the effect of adding external memory on ﬁve language modeling tasks, all of which involve long-form text: English language books (PG-19), long web articles (C4), technical math papers (ar Xiv Math), source code (Github), and formal theorems (Isabelle). The results show signiﬁcant improvements in the perplexity of the model with the addition of external memory.
Researcher Affiliation	Industry	{yuhuai,mrabe,delesley,szegedy}@google.com
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	No	We plan to release our code as open source.
Open Datasets	Yes	The datasets for C4 and PG-19 are publicly available. Our additional datasets, Github, Isabelle, and Ar Xiv Math are derived from publicly available data buckets, which we link in the main part of the paper.
Dataset Splits	No	The paper describes how documents are split into subsequences for training steps but does not provide explicit train, validation, or test dataset splits (e.g., as percentages or sample counts) for its experiments.
Hardware Specification	Yes	We ran all of our experiments on 32 TPU cores. ... measured on TPUv3.
Software Dependencies	No	The paper mentions software like JAX, Flax, Sentencepiece, and Adafactor optimizer, but does not provide specific version numbers for them.
Experiment Setup	Yes	We used a 12-layer decoder-only transformer (with and without Transformer-XL cache) with an embedding size of 1024, 8 attention heads of dimension 128, and an FFN hidden layer of size 4096. For all of our experiments, we used k = 32. Unless speciﬁed otherwise, we use the 9th layer as the k NN augmented attention layer. We used the Adafactor optimizer (Shazeer & Stern, 2018). In preliminary experiments, we conducted a hyperparameter search to determine the optimal learning rate among three choices ({3.0, 1.0, 3 10 1}), and found that 1.0 works best. We used a linear warmup schedule for the ﬁrst 1000 steps, followed by square root decay. We trained the models from scratch for 500K steps on all the datasets...