XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Authors: Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks.
Researcher Affiliation Collaboration 1Carnegie Mellon University 2Deep Mind 3Google Research.
Pseudocode No The paper describes methods and processes in narrative text and tables, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes The codes used for downloading data and training baseline models are available at https: //github.com/google-research/xtreme.
Open Datasets Yes We use POS tagging data from the Universal Dependencies v2.5 (Nivre et al., 2018) treebanks (...) For NER, we use the Wikiann (Pan et al., 2017) dataset. (...) We use the Tatoeba dataset (Artetxe & Schwenk, 2019)
Dataset Splits Yes Table 1. Characteristics of the datasets in XTREME for the zero-shot transfer setting. For tasks that have training and dev sets in other languages, we only report the English numbers. We report the number of test examples per target language and the nature of the test sets (whether they are translations of English data or independently annotated). (...) XNLI 392,702 2,490 5,010 translations 15 NLI Acc. Misc.
Hardware Specification No The paper states that tasks should be 'trainable on a single GPU' as a design principle, but it does not provide specific hardware models (e.g., GPU models, CPU types, or memory) used for conducting its experiments.
Software Dependencies No The paper mentions various models (mBERT, XLM, XLM-R) but does not provide specific version numbers for software dependencies or libraries used for implementation (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup Yes All hyper-parameter tuning is done on English validation data. We encourage authors evaluating on XTREME to do the same. (...) We report hyperparameters in the appendix.