Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Authors: Kevin Slagle

NeurIPS 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that for a fixed training and inference compute budget, Space Byte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures. Our experiments are performed on datasets consisting of English books, La Te X formatted ar Xiv papers, and open-source code.
Researcher Affiliation Academia Kevin Slagle Rice University EMAIL
Pseudocode Yes See Appendix C for pseudocode. Listing 1: Pytorch pseudocode for Space Byte def forward(self , tokens , targets=None):
Open Source Code Yes Our training code and data reproduction steps can be found at github.com/kjslag/spacebyte. Open source code, job execution scripts, and a jupyter notebook for fully reproducing our results are available at github.com/kjslag/spacebyte.
Open Datasets Yes Following the Mega Byte [7] and Mamba Byte [6] experiments, we benchmark our models on a diverse range of long-form datasets: PG-19 (English-language books written before 1919) [41]; ar Xiv (papers from Ar Xiv written in La Te X, extracted from the ar Xiv component of The Pile [42]); and Github (open-source code repositories, extracted from the Github component of The Pile [42]). Each dataset prepared by downloaded it from Hugging Face8...
Dataset Splits Yes The validation (and test) bits-per-byte for Space Byte-793M+184M on the Stories, ar Xiv, and Github datasets are 0.877 (0.833), 0.658 (0.663) and 0.397 (0.411), which differ by +5%, 1%, and 3%, respectively.
Hardware Specification Yes Each model was trained using Py Torch on a single 40GB Nvidia A40 and A100 GPUs with mixedprecision (bfloat16 and float32) training and Flash Attention [57, 58].
Software Dependencies No The paper mentions PyTorch and Flash Attention as software used, and provides a command for `spm_train`, but does not provide specific version numbers for these software components in the main text or appendices. Although a `requirements.txt` is mentioned as available in the code justification, this information is not within the paper's text itself.
Experiment Setup Yes We train all models using a compute-controlled setup, using either 1018 or 1019 FLOPs. All models are trained using Adam W [55] with β1 = 0.9, β2 = 0.98, batch size 64, weight decay of 0.01, and gradient clipping [56] with a maximum norm of 1.0. For models trained using 1018 FLOPs, we train model dimensions D {384, 512, 768}. For models trained using 1019 FLOPs, we train model dimensions D {512, 768, 1024}.