SpaceByte: Towards Deleting Tokenization from Large Language Modeling

Authors: Kevin Slagle

NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments show that for a fixed training and inference compute budget, Space Byte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures. Our experiments are performed on datasets consisting of English books, La Te X formatted ar Xiv papers, and open-source code.
Researcher Affiliation Academia Kevin Slagle Rice University kevin.slagle@rice.edu
Pseudocode Yes See Appendix C for pseudocode. Listing 1: Pytorch pseudocode for Space Byte def forward(self , tokens , targets=None):
Open Source Code Yes Our training code and data reproduction steps can be found at github.com/kjslag/spacebyte. Open source code, job execution scripts, and a jupyter notebook for fully reproducing our results are available at github.com/kjslag/spacebyte.
Open Datasets Yes Following the Mega Byte [7] and Mamba Byte [6] experiments, we benchmark our models on a diverse range of long-form datasets: PG-19 (English-language books written before 1919) [41]; ar Xiv (papers from Ar Xiv written in La Te X, extracted from the ar Xiv component of The Pile [42]); and Github (open-source code repositories, extracted from the Github component of The Pile [42]). Each dataset prepared by downloaded it from Hugging Face8...
Dataset Splits Yes The validation (and test) bits-per-byte for Space Byte-793M+184M on the Stories, ar Xiv, and Github datasets are 0.877 (0.833), 0.658 (0.663) and 0.397 (0.411), which differ by +5%, 1%, and 3%, respectively.
Hardware Specification Yes Each model was trained using Py Torch on a single 40GB Nvidia A40 and A100 GPUs with mixedprecision (bfloat16 and float32) training and Flash Attention [57, 58].
Software Dependencies No The paper mentions PyTorch and Flash Attention as software used, and provides a command for `spm_train`, but does not provide specific version numbers for these software components in the main text or appendices. Although a `requirements.txt` is mentioned as available in the code justification, this information is not within the paper's text itself.
Experiment Setup Yes We train all models using a compute-controlled setup, using either 1018 or 1019 FLOPs. All models are trained using Adam W [55] with β1 = 0.9, β2 = 0.98, batch size 64, weight decay of 0.01, and gradient clipping [56] with a maximum norm of 1.0. For models trained using 1018 FLOPs, we train model dimensions D {384, 512, 768}. For models trained using 1019 FLOPs, we train model dimensions D {512, 768, 1024}.