SpaceByte: Towards Deleting Tokenization from Large Language Modeling
Authors: Kevin Slagle
NeurIPS 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show that for a fixed training and inference compute budget, Space Byte outperforms other byte-level architectures and roughly matches the performance of tokenized Transformer architectures. Our experiments are performed on datasets consisting of English books, La Te X formatted ar Xiv papers, and open-source code. |
| Researcher Affiliation | Academia | Kevin Slagle Rice University kevin.slagle@rice.edu |
| Pseudocode | Yes | See Appendix C for pseudocode. Listing 1: Pytorch pseudocode for Space Byte def forward(self , tokens , targets=None): |
| Open Source Code | Yes | Our training code and data reproduction steps can be found at github.com/kjslag/spacebyte. Open source code, job execution scripts, and a jupyter notebook for fully reproducing our results are available at github.com/kjslag/spacebyte. |
| Open Datasets | Yes | Following the Mega Byte [7] and Mamba Byte [6] experiments, we benchmark our models on a diverse range of long-form datasets: PG-19 (English-language books written before 1919) [41]; ar Xiv (papers from Ar Xiv written in La Te X, extracted from the ar Xiv component of The Pile [42]); and Github (open-source code repositories, extracted from the Github component of The Pile [42]). Each dataset prepared by downloaded it from Hugging Face8... |
| Dataset Splits | Yes | The validation (and test) bits-per-byte for Space Byte-793M+184M on the Stories, ar Xiv, and Github datasets are 0.877 (0.833), 0.658 (0.663) and 0.397 (0.411), which differ by +5%, 1%, and 3%, respectively. |
| Hardware Specification | Yes | Each model was trained using Py Torch on a single 40GB Nvidia A40 and A100 GPUs with mixedprecision (bfloat16 and float32) training and Flash Attention [57, 58]. |
| Software Dependencies | No | The paper mentions PyTorch and Flash Attention as software used, and provides a command for `spm_train`, but does not provide specific version numbers for these software components in the main text or appendices. Although a `requirements.txt` is mentioned as available in the code justification, this information is not within the paper's text itself. |
| Experiment Setup | Yes | We train all models using a compute-controlled setup, using either 1018 or 1019 FLOPs. All models are trained using Adam W [55] with β1 = 0.9, β2 = 0.98, batch size 64, weight decay of 0.01, and gradient clipping [56] with a maximum norm of 1.0. For models trained using 1018 FLOPs, we train model dimensions D {384, 512, 768}. For models trained using 1019 FLOPs, we train model dimensions D {512, 768, 1024}. |