Transformer-VQ: Linear-Time Transformers via Vector Quantization
Authors: Lucas Dax Lingle
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on Image Net64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. |
| Researcher Affiliation | Industry | Lucas D. Lingle Independent Researcher lucasdaxlingle@gmail.com |
| Pseudocode | Yes | See pseudocode in Appendix E. Code 1: Jax/Flax pseudocode for VQ-Attention. |
| Open Source Code | Yes | Code available: https://github.com/transformer-vq/transformer_vq |
| Open Datasets | Yes | Enwik8 is a byte-level language modeling dataset consisting of 100 million bytes of unprocessed Englishlanguage Wikipedia articles (Mahoney, 2011)... Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020). |
| Dataset Splits | Yes | Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020). |
| Hardware Specification | Yes | For training, we use TPU v3 pod slices (Jouppi et al., 2017). We benchmark on a TPU v3 with 8 cores, using a global batch size of 8 sequences. |
| Software Dependencies | No | Transformer-VQ is implemented in Jax (Bradbury et al., 2018) and Flax (Heek et al., 2023). |
| Experiment Setup | Yes | C.1 HYPERPARAMETERS Per-dataset hyperparameters are provided below. Table 10: Hyperparameters. |