Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..
Explaining and Mitigating Crosslingual Tokenizer Inequities
Authors: Catherine Arnett, Tyler Chang, Stella Biderman, Benjamin Bergen
NeurIPS 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization. |
| Researcher Affiliation | Collaboration | Catherine Arnett1, Tyler A. Chang2, Stella Biderman1, and Benjamin K. Bergen2 1Eleuther AI, 2UC San Diego |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in the main text. |
| Open Source Code | Yes | We release all tokenizers and training datasets Hugging Face7. Code for training and evaluation is released: https://github.com/catherinearnett/explaining_tokenizer_inequities. |
| Open Datasets | Yes | In total, we train approximately 7000 monolingual tokenizers, which we make available on Hugging Face: https://huggingface.co/datasets/catherinearnett/montok. We use the FLORES-200 dataset (Costa-Jussà et al., 2022), which is a high-quality parallel translation dataset and which includes all 97 languages in our sample. |
| Dataset Splits | No | We use the text datasets from Chang et al. (2024) for all tokenizer training... We calculate token premiums by calculating CTC on parallel text for the corresponding language for each tokenizer. We use the FLORES-200 dataset (Costa-Jussà et al., 2022)... This describes separate training and evaluation datasets, not a train/test/validation split of a single dataset. |
| Hardware Specification | Yes | All tokenizers were trained using the CPUs from one server equipped with an NVIDIA RTX A6000. |
| Software Dependencies | Yes | We use the Hugging Face tokenizers (Hugging Face, 2020) package to train these tokenizers. [Reference Hugging Face (2020) lists: tokenizers package. v0.21.1.] |
| Experiment Setup | Yes | For each language, we train a tokenizer for two tokenizer types (BPE and Unigram; Sennrich et al., 2016; Kudo, 2018) on 300MB of text data, for seven vocabulary sizes ranging from 16384 to 114688. All vocabulary sizes we use are divisible by 128... |