Transformer-VQ: Linear-Time Transformers via Vector Quantization

Authors: Lucas Dax Lingle

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on Image Net64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput.
Researcher Affiliation Industry Lucas D. Lingle Independent Researcher lucasdaxlingle@gmail.com
Pseudocode Yes See pseudocode in Appendix E. Code 1: Jax/Flax pseudocode for VQ-Attention.
Open Source Code Yes Code available: https://github.com/transformer-vq/transformer_vq
Open Datasets Yes Enwik8 is a byte-level language modeling dataset consisting of 100 million bytes of unprocessed Englishlanguage Wikipedia articles (Mahoney, 2011)... Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020).
Dataset Splits Yes Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020).
Hardware Specification Yes For training, we use TPU v3 pod slices (Jouppi et al., 2017). We benchmark on a TPU v3 with 8 cores, using a global batch size of 8 sequences.
Software Dependencies No Transformer-VQ is implemented in Jax (Bradbury et al., 2018) and Flax (Heek et al., 2023).
Experiment Setup Yes C.1 HYPERPARAMETERS Per-dataset hyperparameters are provided below. Table 10: Hyperparameters.