Reformer: The Efficient Transformer

Authors: Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We experiment on a synthetic task, a text task (enwik8) with sequences of length 64K and an image generation task (imagenet-64 generation) with sequences of length 12K. In both cases we show that Reformer matches the results obtained with full Transformer but runs much faster, especially on the text task, and with orders of magnitude better memory efficiency.
Researcher Affiliation Collaboration Nikita Kitaev U.C. Berkeley & Google Research kitaev@cs.berkeley.edu Łukasz Kaiser Google Research {lukaszkaiser,levskaya}@google.com Anselm Levskaya Google Research
Pseudocode No The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm".
Open Source Code Yes Code for training our models is made publicly available.2 (Footnote 2: https://github.com/google/trax/tree/master/trax/models/reformer)
Open Datasets Yes We ran our experiments on the imagenet64 and enwik8-64K tasks, where the latter is a variant of enwik8 that is chunked into subsequences of 216 = 64K tokens.
Dataset Splits No The paper mentions training and evaluation on 'held-out data' and 'test set', but does not provide specific percentages or counts for training, validation, and test splits for the datasets used (enwik8, imagenet64, WMT 2014).
Hardware Specification Yes Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores).
Software Dependencies No The paper mentions using the Adafactor optimizer for training but does not provide specific version numbers for it or any other software dependencies.
Experiment Setup Yes All experiments have dmodel = 1024, dff = 4096, nheads = 8, and a total batch size of 8 sequences. We used the Adafactor optimizer (Shazeer & Stern, 2018) for training these models. We train it for 150K steps in 4 different settings: with full attention, LSH attention with nrounds = 1, nrounds = 2 and nrounds = 4.