reproducibilityindex.ai

Language Model Tokenizers Introduce Unfairness Between Languages

Authors: Aleksandar Petrov, Emanuele La Malfa, Philip Torr, Adel Bibi

NeurIPS 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this paper, we show how disparity in the treatment of different languages arises at the tokenization stage, well before a model is even invoked. The same text translated into different languages can have drastically different tokenization lengths, with differences up to 15 times in some cases. These disparities persist even for tokenizers that are intentionally trained for multilingual support. Character-level and byte-level models also exhibit over 4 times the difference in the encoding length for some language pairs. This induces unfair treatment for some language communities in regard to the cost of accessing commercial language services, the processing time and latency, as well as the amount of content that can be provided as context to the models. Therefore, we make the case that we should train future language models using multilingually fair subword tokenizers.
Researcher Affiliation	Academia	Aleksandar Petrov, Emanuele La Malfa, Philip H.S. Torr, Adel Bibi University of Oxford aleks@robots.ox.ac.uk
Pseudocode	No	The paper does not contain any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper states: 'An interactive table of all the languages and tokenizers is also available on the project website.' This refers to data/results, not the source code for the methodology described in the paper.
Open Datasets	Yes	To this end, we use the FLORES-200 parallel corpus, comprising of the same 2000 sentences taken from Wikipedia and human-translated to 200 different languages (Guzmán et al., 2019; Goyal et al., 2021; Costa-jussà et al., 2022).
Dataset Splits	No	The paper uses the FLORES-200 parallel corpus but does not specify any training, validation, or test splits for its experiments.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU or CPU models used for running its experiments.
Software Dependencies	No	The paper discusses various models and tokenizers but does not provide specific software dependency versions (e.g., library names with version numbers) for its experimental setup.
Experiment Setup	No	The paper describes the evaluation process of existing tokenizers but does not provide specific experimental setup details like hyperparameter values or training configurations for its own analysis.