Machine-Created Universal Language for Cross-Lingual Transfer

Authors: Yaobo Liang, Quanzhi Zhu, Junhe Zhao, Nan Duan

AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. We conduct experiments on XNLI, NER, MLQA, and Tatoeba using MUL as input.
Researcher Affiliation Industry Microsoft Research Asia {yaobo.liang, v-quanzhizhu, v-junhezhao, nanduan}@microsoft.com
Pseudocode No The paper describes the steps of its method but does not include structured pseudocode or algorithm blocks.
Open Source Code Yes The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL.
Open Datasets Yes In the first stage, we pre-train the encoder with a multilingual MLM objective on 15 languages of XNLI. The pre-training corpus is CC-Net (Wenzek et al. 2020). In the second stage, we train our model on bilingual data OPUS-100 (Zhang et al. 2020).
Dataset Splits No The paper mentions using 'English training data' for some tasks and pre-training/fine-tuning, but it does not explicitly provide details about train/validation/test splits (e.g., percentages or sample counts for each split) for the datasets used.
Hardware Specification No The paper mentions 'Limited by resources' and 'GPU memory usage' but does not specify any particular GPU or CPU models, or other hardware components used for experiments.
Software Dependencies No The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x).
Experiment Setup Yes Limited by resources, we pre-train the model for 500K steps with a batch size of 8192, which is less than XLM-R Base. The hyper-parameters in pre-training and finetuning are the same as those of natural language. The size of the universal vocabulary K is set to 60K.