Reproducibility Index

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenizations

Authors: Brian Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Surprisingly, when evaluated across 20 benchmarks, we find that instruction-tuned models retain up to 93.4% of their original performance when given a randomly sampled tokenization, and 90.8% with character-level tokenization. We see that overall stronger models tend to be more robust, and robustness diminishes as the tokenization departs farther from the canonical form. Motivated by these results, we then identify settings where non-canonical tokenization schemes can improve performance, finding that character-level segmentation improves string manipulation and code understanding tasks by up to +14%, and right-aligned digit grouping enhances large-number arithmetic by +33%. Finally, we investigate the source of this robustness, finding that it arises in the instruction-tuning phase.
Researcher Affiliation	Collaboration	Brian Siyuan Zheng Alisa Liu Orevaoghene Ahia Jonathan Hayase Yejin Choi Noah A. Smith University of Washington Allen Institute for AI Stanford University
Pseudocode	Yes	The pseudocode for the algorithm is in 1. Algorithm 1 Random Token Segmentation
Open Source Code	Yes	1Code is available at https://github.com/Brianzhengca/Tokenizer-Robustness.
Open Datasets	Yes	We consider three models, LLAMA-3.1-8B-INSTRUCT [43], OLMO2-7B-INSTRUCT [48], and QWEN-2.5-7B-INSTRUCT [53], which we evaluate on 20 benchmarks shown in Table 1. Please see B.1 for further description of the datasets and evaluation setup.
Dataset Splits	No	For the ablation studies in Section 4.2: finetuning the LLAMA-3.2-1B base model on the TULU 3 SFT PERSONAS INSTRUCTION FOLLOWING dataset. Then, we perform the following interventions on the SFT training data and procedure to shed light on the possible source. ... Counting characters: This task asks the model...and contains 1001 samples. ... Acronyms This task...We construct 3594 5-letter acronyms... Code Description This task contains 4800 samples... Arithmetic This task contains 1000 addition and subtraction questions in total.
Hardware Specification	Yes	Setup: 8 L40S GPUs
Software Dependencies	No	Our finetuning code was forked from allenai/open-instruct. The exact finetune recipe is given below:
Experiment Setup	Yes	Gradient Accumulation Steps: 20 Per Device Train Batch Size: 2 Mixed Predision: bf16 Max Seq Length: 4096 Learning Rate: 5e-06 LR Scheduler Type: Linear Warmup Ratio: 0.03 Weight Decay: 0