XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation
Authors: Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson
ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks. |
| Researcher Affiliation | Collaboration | 1Carnegie Mellon University 2Deep Mind 3Google Research. |
| Pseudocode | No | The paper describes methods and processes in narrative text and tables, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | The codes used for downloading data and training baseline models are available at https: //github.com/google-research/xtreme. |
| Open Datasets | Yes | We use POS tagging data from the Universal Dependencies v2.5 (Nivre et al., 2018) treebanks (...) For NER, we use the Wikiann (Pan et al., 2017) dataset. (...) We use the Tatoeba dataset (Artetxe & Schwenk, 2019) |
| Dataset Splits | Yes | Table 1. Characteristics of the datasets in XTREME for the zero-shot transfer setting. For tasks that have training and dev sets in other languages, we only report the English numbers. We report the number of test examples per target language and the nature of the test sets (whether they are translations of English data or independently annotated). (...) XNLI 392,702 2,490 5,010 translations 15 NLI Acc. Misc. |
| Hardware Specification | No | The paper states that tasks should be 'trainable on a single GPU' as a design principle, but it does not provide specific hardware models (e.g., GPU models, CPU types, or memory) used for conducting its experiments. |
| Software Dependencies | No | The paper mentions various models (mBERT, XLM, XLM-R) but does not provide specific version numbers for software dependencies or libraries used for implementation (e.g., Python, TensorFlow, PyTorch versions). |
| Experiment Setup | Yes | All hyper-parameter tuning is done on English validation data. We encourage authors evaluating on XTREME to do the same. (...) We report hyperparameters in the appendix. |