Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Transformer-VQ: Linear-Time Transformers via Vector Quantization
Authors: Lucas Dax Lingle
ICLR 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our large-scale experiments, Transformer-VQ is shown highly competitive in quality, obtaining 0.99 bpb on Enwik8, 26.6 ppl on PG-19, and 3.16 bpb on Image Net64. In addition, the optimized implementation of Transformer-VQ is over 3x faster than a comparable quadratic-time transformer at sequence length 8k, is over 12x faster at 32k, and can scale to 131k with similar throughput. |
| Researcher Affiliation | Industry | Lucas D. Lingle Independent Researcher EMAIL |
| Pseudocode | Yes | See pseudocode in Appendix E. Code 1: Jax/Flax pseudocode for VQ-Attention. |
| Open Source Code | Yes | Code available: https://github.com/transformer-vq/transformer_vq |
| Open Datasets | Yes | Enwik8 is a byte-level language modeling dataset consisting of 100 million bytes of unprocessed Englishlanguage Wikipedia articles (Mahoney, 2011)... Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020). |
| Dataset Splits | Yes | Per convention, it is split into train, validation, and test sets of 90 million, 5 million, and 5 million bytes, respectively (Child et al., 2019; Rae et al., 2020). |
| Hardware Specification | Yes | For training, we use TPU v3 pod slices (Jouppi et al., 2017). We benchmark on a TPU v3 with 8 cores, using a global batch size of 8 sequences. |
| Software Dependencies | No | Transformer-VQ is implemented in Jax (Bradbury et al., 2018) and Flax (Heek et al., 2023). |
| Experiment Setup | Yes | C.1 HYPERPARAMETERS Per-dataset hyperparameters are provided below. Table 10: Hyperparameters. |