reproducibilityindex.ai

Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We experiment on a synthetic task, a text task (enwik8) with sequences of length 64K and an image generation task (imagenet-64 generation) with sequences of length 12K. In both cases we show that Reformer matches the results obtained with full Transformer but runs much faster, especially on the text task, and with orders of magnitude better memory efﬁciency.
Researcher Affiliation	Collaboration	Nikita Kitaev U.C. Berkeley & Google Research kitaev@cs.berkeley.edu Łukasz Kaiser Google Research {lukaszkaiser,levskaya}@google.com Anselm Levskaya Google Research
Pseudocode	No	The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code	Yes	Code for training our models is made publicly available.2 (Footnote 2: https://github.com/google/trax/tree/master/trax/models/reformer)
Open Datasets	Yes	We ran our experiments on the imagenet64 and enwik8-64K tasks, where the latter is a variant of enwik8 that is chunked into subsequences of 216 = 64K tokens.
Dataset Splits	No	The paper mentions training and evaluation on 'held-out data' and 'test set', but does not provide specific percentages or counts for training, validation, and test splits for the datasets used (enwik8, imagenet64, WMT 2014).
Hardware Specification	Yes	Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores).
Software Dependencies	No	The paper mentions using the Adafactor optimizer for training but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup	Yes	All experiments have dmodel = 1024, dff = 4096, nheads = 8, and a total batch size of 8 sequences. We used the Adafactor optimizer (Shazeer & Stern, 2018) for training these models. We train it for 150K steps in 4 different settings: with full attention, LSH attention with nrounds = 1, nrounds = 2 and nrounds = 4.