Compressive Transformers for Long-Range Sequence Modelling
Authors: Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap
ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We find the Compressive Transformer obtains state-of-the-art language modelling results in the Wiki Text-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task. |
| Researcher Affiliation | Collaboration | Authors contributed equally, Deep Mind, London, UK. Co MPLEX, Computer Science, University College London, UK. Please direct correspondence to {jwrae, apotapenko}@google.com. |
| Pseudocode | Yes | Algorithm 1 Compressive Transformer... Algorithm 2 Attention-Reconstruction Loss |
| Open Source Code | Yes | A TF implementation can be found in Sonnet: https://github.com/deepmind/sonnet |
| Open Datasets | Yes | We propose a new language modelling benchmark, PG-19, using text from books extracted from Project Gutenberg... PG-19 is available at https://github.com/deepmind/pg19... Wiki Text-103 (Merity et al., 2016)... Enwik8 taken from the Hutter Prize (Hutter, 2012) |
| Dataset Splits | Yes | Train Valid. Test # books 28,602 50 100 # words 1,973,136,207 3,007,061 6,966,499... We select the first 90MB for training, 5MB for validation, and the latter 5MB for testing as per convention. |
| Hardware Specification | Yes | The model was trained on 256 TPUv3 cores with a total batch size of 512... We train each network with 32 V100 GPUs, and a batch size of 1 per core (total batch size of 32) using synchronous training. |
| Software Dependencies | No | The paper mentions 'Subword Text Encoder from the tfds package in TensorFlow' and a 'TF implementation... in Sonnet', but it does not specify version numbers for TensorFlow, tfds, or Sonnet, nor any other key software dependencies. |
| Experiment Setup | Yes | We optimised all models with Adam (Kingma and Ba, 2014). We used a learning rate schedule with a linear warmup from 1e-6 to 3e-4 and a cosine decay back down to 1e-n6. For character-based LM we used 4, 000 warmup steps with 100, 000 decay steps, and for word-based LM we used 16, 000 warmup steps with 500, 000 decay steps. We clipped the gradients to have a norm of at most 0.1... We train a 36 layer Compressive Transformer with a window size of 512, both memory and compressed memory size of 512, and compression rate C = 2. We compare this to a 36 layer Transformer XL trained with window size 512 and attention window 1024. The model was trained on 256 TPUv3 cores with a total batch size of 512. |