Reformer: The Efficient Transformer
Authors: Nikita Kitaev, Lukasz Kaiser, Anselm Levskaya
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We experiment on a synthetic task, a text task (enwik8) with sequences of length 64K and an image generation task (imagenet-64 generation) with sequences of length 12K. In both cases we show that Reformer matches the results obtained with full Transformer but runs much faster, especially on the text task, and with orders of magnitude better memory efficiency. |
| Researcher Affiliation | Collaboration | Nikita Kitaev U.C. Berkeley & Google Research kitaev@cs.berkeley.edu Łukasz Kaiser Google Research {lukaszkaiser,levskaya}@google.com Anselm Levskaya Google Research |
| Pseudocode | No | The paper does not contain any blocks explicitly labeled as "Pseudocode" or "Algorithm". |
| Open Source Code | Yes | Code for training our models is made publicly available.2 (Footnote 2: https://github.com/google/trax/tree/master/trax/models/reformer) |
| Open Datasets | Yes | We ran our experiments on the imagenet64 and enwik8-64K tasks, where the latter is a variant of enwik8 that is chunked into subsequences of 216 = 64K tokens. |
| Dataset Splits | No | The paper mentions training and evaluation on 'held-out data' and 'test set', but does not provide specific percentages or counts for training, validation, and test splits for the datasets used (enwik8, imagenet64, WMT 2014). |
| Hardware Specification | Yes | Training for all experiments was parallelized across 8 devices (8 GPUs or 8 TPU v3 cores). |
| Software Dependencies | No | The paper mentions using the Adafactor optimizer for training but does not provide specific version numbers for it or any other software dependencies. |
| Experiment Setup | Yes | All experiments have dmodel = 1024, dff = 4096, nheads = 8, and a total batch size of 8 sequences. We used the Adafactor optimizer (Shazeer & Stern, 2018) for training these models. We train it for 150K steps in 4 different settings: with full attention, LSH attention with nrounds = 1, nrounds = 2 and nrounds = 4. |