Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations
Authors: Brian Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase. |
| Researcher Affiliation | Collaboration | Brian Siyuan Zheng Alisa Liu Orevaoghene Ahia Jonathan Hayase Yejin Choi Noah A. Smith University of Washington Allen Institute for AI Stanford University |
| Pseudocode | Yes | The pseudocode for the algorithm is in 1. Algorithm 1 Random Token Segmentation |
| Open Source Code | Yes | 1Code is available at https://github.com/Brianzhengca/Tokenizer-Robustness. |
| Open Datasets | Yes | We consider three models, LLAMA-3.1-8B-INSTRUCT [43], OLMO2-7B-INSTRUCT [48], and QWEN-2.5-7B-INSTRUCT [53], which we evaluate on 20 benchmarks shown in Table 1. Please see B.1 for further description of the datasets and evaluation setup. |
| Dataset Splits | No | For the ablation studies in Section 4.2: finetuning the LLAMA-3.2-1B base model on the TULU 3 SFT PERSONAS INSTRUCTION FOLLOWING dataset. Then, we perform the following interventions on the SFT training data and procedure to shed light on the possible source. ... Counting characters: This task asks the model...and contains 1001 samples. ... Acronyms This task...We construct 3594 5-letter acronyms... Code Description This task contains 4800 samples... Arithmetic This task contains 1000 addition and subtraction questions in total. |
| Hardware Specification | Yes | Setup: 8 L40S GPUs |
| Software Dependencies | No | Our finetuning code was forked from allenai/open-instruct. The exact finetune recipe is given below: |
| Experiment Setup | Yes | Gradient Accumulation Steps: 20 Per Device Train Batch Size: 2 Mixed Predision: bf16 Max Seq Length: 4096 Learning Rate: 5e-06 LR Scheduler Type: Linear Warmup Ratio: 0.03 Weight Decay: 0 |