IGLUE: A Benchmark for Transfer Learning across Modalities, Tasks, and Languages

Authors: Emanuele Bugliarello, Fangyu Liu, Jonas Pfeiffer, Siva Reddy, Desmond Elliott, Edoardo Maria Ponti, Ivan Vulić

ICML 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our benchmark enables the evaluation of multilingual multimodal models for transfer learning, not only in a zero-shot setting, but also in newly defined few-shot learning setups. Based on the evaluation of the available state-of-the-art models, we find that translate-test transfer is superior to zero-shot transfer and that few-shot learning is hard to harness for many tasks.
Researcher Affiliation Academia 1University of Copenhagen 2Mila Quebec Artificial Intelligence Institute 3University of Cambridge 4TU Darmstadt 5New York University 6Mc Gill University.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes We provide data and code for the evaluation of multilingual V&L models at https://iglue-benchmark.github.io/.
Open Datasets Yes We create IGLUE by collating current research threads in this area and extending them with two datasets for crosslingual visual entailment (XVNLI) and image text retrieval (x Flickr&CO), for an even more comprehensive coverage of tasks. ... We combine the text-only dataset SNLI (Bowman et al., 2015), with its multimodal (Xie et al., 2019) and cross-lingual (Agi c & Schluter, 2018) counterparts. ... The NLVR2 data (Suhr et al., 2019) in English are used for training.
Dataset Splits Yes We provide new train, development, and test splits such that the test split consists of images covered by the underlying cross-lingual text-only dataset. ... For each run, we select the checkpoint that achieves the largest validation performance for evaluation on the test sets.
Hardware Specification Yes We train all models on a single NVIDIA V100 (16GB) GPU card.
Software Dependencies No The paper mentions software like 'Py Torch (Paszke et al., 2019)' and 'VOLTA (Bugliarello et al., 2021)', but does not provide specific version numbers for these software components to enable reproducible setup.
Experiment Setup Yes We fine-tune all models using the Adam W optimiser (Loshchilov & Hutter, 2019) relying on the same hyper-parameters as in the controlled setup of Bugliarello et al. (2021). For few-shot experiments, we instead search three learning rates {1e-5, 5e-5, 1e-4} and train for 20 epochs for each dataset-language-shots triplet. ... Visual Entailment. ... We use a learning rate of 2e-5 and max token length of 80. ... Visual QA. ... batch size of 256 for 5 epochs. We use a learning rate of 4e-5 and max token length of 40.