Compressive Transformers for Long-Range Sequence Modelling

Authors: Jack W. Rae, Anna Potapenko, Siddhant M. Jayakumar, Chloe Hillier, Timothy P. Lillicrap

ICLR 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We find the Compressive Transformer obtains state-of-the-art language modelling results in the Wiki Text-103 and Enwik8 benchmarks, achieving 17.1 ppl and 0.97 bpc respectively. We also find it can model high-frequency speech effectively and can be used as a memory mechanism for RL, demonstrated on an object matching task.
Researcher Affiliation Collaboration Authors contributed equally, Deep Mind, London, UK. Co MPLEX, Computer Science, University College London, UK. Please direct correspondence to {jwrae, apotapenko}@google.com.
Pseudocode Yes Algorithm 1 Compressive Transformer... Algorithm 2 Attention-Reconstruction Loss
Open Source Code Yes A TF implementation can be found in Sonnet: https://github.com/deepmind/sonnet
Open Datasets Yes We propose a new language modelling benchmark, PG-19, using text from books extracted from Project Gutenberg... PG-19 is available at https://github.com/deepmind/pg19... Wiki Text-103 (Merity et al., 2016)... Enwik8 taken from the Hutter Prize (Hutter, 2012)
Dataset Splits Yes Train Valid. Test # books 28,602 50 100 # words 1,973,136,207 3,007,061 6,966,499... We select the first 90MB for training, 5MB for validation, and the latter 5MB for testing as per convention.
Hardware Specification Yes The model was trained on 256 TPUv3 cores with a total batch size of 512... We train each network with 32 V100 GPUs, and a batch size of 1 per core (total batch size of 32) using synchronous training.
Software Dependencies No The paper mentions 'Subword Text Encoder from the tfds package in TensorFlow' and a 'TF implementation... in Sonnet', but it does not specify version numbers for TensorFlow, tfds, or Sonnet, nor any other key software dependencies.
Experiment Setup Yes We optimised all models with Adam (Kingma and Ba, 2014). We used a learning rate schedule with a linear warmup from 1e-6 to 3e-4 and a cosine decay back down to 1e-n6. For character-based LM we used 4, 000 warmup steps with 100, 000 decay steps, and for word-based LM we used 16, 000 warmup steps with 500, 000 decay steps. We clipped the gradients to have a norm of at most 0.1... We train a 36 layer Compressive Transformer with a window size of 512, both memory and compressed memory size of 512, and compression rate C = 2. We compare this to a 36 layer Transformer XL trained with window size 512 and attention window 1024. The model was trained on 256 TPUv3 cores with a total batch size of 512.