reproducibilityindex.ai

XTREME: A Massively Multilingual Multi-task Benchmark for Evaluating Cross-lingual Generalisation

Authors: Junjie Hu, Sebastian Ruder, Aditya Siddhant, Graham Neubig, Orhan Firat, Melvin Johnson

ICML 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To this end, we introduce the Cross-lingual TRansfer Evaluation of Multilingual Encoders (XTREME) benchmark, a multi-task benchmark for evaluating the cross-lingual generalization capabilities of multilingual representations across 40 languages and 9 tasks. We demonstrate that while models tested on English reach human performance on many tasks, there is still a sizable gap in the performance of cross-lingually transferred models, particularly on syntactic and sentence retrieval tasks.
Researcher Affiliation	Collaboration	1Carnegie Mellon University 2Deep Mind 3Google Research.
Pseudocode	No	The paper describes methods and processes in narrative text and tables, but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	The codes used for downloading data and training baseline models are available at https: //github.com/google-research/xtreme.
Open Datasets	Yes	We use POS tagging data from the Universal Dependencies v2.5 (Nivre et al., 2018) treebanks (...) For NER, we use the Wikiann (Pan et al., 2017) dataset. (...) We use the Tatoeba dataset (Artetxe & Schwenk, 2019)
Dataset Splits	Yes	Table 1. Characteristics of the datasets in XTREME for the zero-shot transfer setting. For tasks that have training and dev sets in other languages, we only report the English numbers. We report the number of test examples per target language and the nature of the test sets (whether they are translations of English data or independently annotated). (...) XNLI 392,702 2,490 5,010 translations 15 NLI Acc. Misc.
Hardware Specification	No	The paper states that tasks should be 'trainable on a single GPU' as a design principle, but it does not provide specific hardware models (e.g., GPU models, CPU types, or memory) used for conducting its experiments.
Software Dependencies	No	The paper mentions various models (mBERT, XLM, XLM-R) but does not provide specific version numbers for software dependencies or libraries used for implementation (e.g., Python, TensorFlow, PyTorch versions).
Experiment Setup	Yes	All hyper-parameter tuning is done on English validation data. We encourage authors evaluating on XTREME to do the same. (...) We report hyperparameters in the appendix.