Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Authors: Kevin Slagle
NeurIPS 2024 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that for a fixed training and inference compute budget, Space Byte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures. Our experiments are performed on datasets consisting of English books, La Te X formatted ar Xiv papers, and open-source code. |
| Researcher Affiliation | Academia | Kevin Slagle Rice University EMAIL |
| Pseudocode | Yes | See Appendix C for pseudocode. Listing 1: Pytorch pseudocode for Space Byte def forward(self , tokens , targets=None): |
| Open Source Code | Yes | Our training code and data reproduction steps can be found at github.com/kjslag/spacebyte. Open source code, job execution scripts, and a jupyter notebook for fully reproducing our results are available at github.com/kjslag/spacebyte. |
| Open Datasets | Yes | Following the Mega Byte [7] and Mamba Byte [6] experiments, we benchmark our models on a diverse range of long-form datasets: PG-19 (English-language books written before 1919) [41]; ar Xiv (papers from Ar Xiv written in La Te X, extracted from the ar Xiv component of The Pile [42]); and Github (open-source code repositories, extracted from the Github component of The Pile [42]). Each dataset prepared by downloaded it from Hugging Face8... |
| Dataset Splits | Yes | The validation (and test) bits-per-byte for Space Byte-793M+184M on the Stories, ar Xiv, and Github datasets are 0.877 (0.833), 0.658 (0.663) and 0.397 (0.411), which differ by +5%, 1%, and 3%, respectively. |
| Hardware Specification | Yes | Each model was trained using Py Torch on a single 40GB Nvidia A40 and A100 GPUs with mixedprecision (bfloat16 and float32) training and Flash Attention [57, 58]. |
| Software Dependencies | No | The paper mentions PyTorch and Flash Attention as software used, and provides a command for `spm_train`, but does not provide specific version numbers for these software components in the main text or appendices. Although a `requirements.txt` is mentioned as available in the code justification, this information is not within the paper's text itself. |
| Experiment Setup | Yes | We train all models using a compute-controlled setup, using either 1018 or 1019 FLOPs. All models are trained using Adam W [55] with β1 = 0.9, β2 = 0.98, batch size 64, weight decay of 0.01, and gradient clipping [56] with a maximum norm of 1.0. For models trained using 1018 FLOPs, we train model dimensions D {384, 512, 768}. For models trained using 1019 FLOPs, we train model dimensions D {512, 768, 1024}. |