Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in Coakley et alK. L. Coakley, T. Snelleman, H. Hoos, and O. E. Gundersen, "The embrace of open science: An analysis of a decade of AI research and 56 800 conference papers," Under Review, 2026..

Explaining and Mitigating Crosslingual Tokenizer Inequities

Authors: Catherine Arnett, Tyler Chang, Stella Biderman, Benjamin Bergen

NeurIPS 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To understand the cross-linguistic differences that cause these token premiums, we train a suite of approximately 7,000 comparable monolingual tokenizers for 97 languages, manipulating tokenization algorithm, vocabulary size, and dataset size. We measure token premiums and test for a relationship between factors such as data similarity (between tokenizer training and evaluation), vocabulary size, and pre-tokenization.
Researcher Affiliation Collaboration Catherine Arnett1, Tyler A. Chang2, Stella Biderman1, and Benjamin K. Bergen2 1Eleuther AI, 2UC San Diego
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks. Methods are described in the main text.
Open Source Code Yes We release all tokenizers and training datasets Hugging Face7. Code for training and evaluation is released: https://github.com/catherinearnett/explaining_tokenizer_inequities.
Open Datasets Yes In total, we train approximately 7000 monolingual tokenizers, which we make available on Hugging Face: https://huggingface.co/datasets/catherinearnett/montok. We use the FLORES-200 dataset (Costa-Jussà et al., 2022), which is a high-quality parallel translation dataset and which includes all 97 languages in our sample.
Dataset Splits No We use the text datasets from Chang et al. (2024) for all tokenizer training... We calculate token premiums by calculating CTC on parallel text for the corresponding language for each tokenizer. We use the FLORES-200 dataset (Costa-Jussà et al., 2022)... This describes separate training and evaluation datasets, not a train/test/validation split of a single dataset.
Hardware Specification Yes All tokenizers were trained using the CPUs from one server equipped with an NVIDIA RTX A6000.
Software Dependencies Yes We use the Hugging Face tokenizers (Hugging Face, 2020) package to train these tokenizers. [Reference Hugging Face (2020) lists: tokenizers package. v0.21.1.]
Experiment Setup Yes For each language, we train a tokenizer for two tokenizer types (BPE and Unigram; Sennrich et al., 2016; Kudo, 2018) on 300MB of text data, for seven vocabulary sizes ranging from 16384 to 114688. All vocabulary sizes we use are divisible by 128...