Language Model Tokenizers Introduce Unfairness Between Languages

Authors: Aleksandar Petrov, Emanuele La Malfa, Philip Torr, Adel Bibi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair subword tokenizers.
Researcher Affiliation Academia Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr, Adel Bibi University of Oxford aleks@robots.ox.ac.uk
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code No The paper states: 'An interactive table of all the languages and tokenizers is also available on the project website.' This refers to data/results, not the source code for the methodology described in the paper.
Open Datasets Yes To this end, we use the FLORES-200 parallel corpus, comprising of the same 2000 sentences taken from Wikipedia and human-translated to 200 different languages (Guzmán et al., 2019; Goyal et al., 2021; Costa-jussà et al., 2022).
Dataset Splits No The paper uses the FLORES-200 parallel corpus but does not specify any training, validation, or test splits for its experiments.
Hardware Specification No The paper does not provide any specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies No The paper discusses various models and tokenizers but does not provide specific software dependency versions (e.g., library names with version numbers) for its experimental setup.
Experiment Setup No The paper describes the evaluation process of existing tokenizers but does not provide specific experimental setup details like hyperparameter values or training configurations for its own analysis.