Language Model Tokenizers Introduce Unfairness Between Languages
Authors: Aleksandar Petrov, Emanuele La Malfa, Philip Torr, Adel Bibi
NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair subword tokenizers. |
| Researcher Affiliation | Academia | Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr, Adel Bibi University of Oxford aleks@robots.ox.ac.uk |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper states: 'An interactive table of all the languages and tokenizers is also available on the project website.' This refers to data/results, not the source code for the methodology described in the paper. |
| Open Datasets | Yes | To this end, we use the FLORES-200 parallel corpus, comprising of the same 2000 sentences taken from Wikipedia and human-translated to 200 different languages (Guzmán et al., 2019; Goyal et al., 2021; Costa-jussà et al., 2022). |
| Dataset Splits | No | The paper uses the FLORES-200 parallel corpus but does not specify any training, validation, or test splits for its experiments. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU or CPU models used for running its experiments. |
| Software Dependencies | No | The paper discusses various models and tokenizers but does not provide specific software dependency versions (e.g., library names with version numbers) for its experimental setup. |
| Experiment Setup | No | The paper describes the evaluation process of existing tokenizers but does not provide specific experimental setup details like hyperparameter values or training configurations for its own analysis. |