Machine-Created Universal Language for Cross-Lingual Transfer
Authors: Yaobo Liang, Quanzhi Zhu, Junhe Zhao, Nan Duan
AAAI 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that translating into MUL yields improved performance compared to multilingual pre-training, and our analysis indicates that MUL possesses strong interpretability. We conduct experiments on XNLI, NER, MLQA, and Tatoeba using MUL as input. |
| Researcher Affiliation | Industry | Microsoft Research Asia {yaobo.liang, v-quanzhizhu, v-junhezhao, nanduan}@microsoft.com |
| Pseudocode | No | The paper describes the steps of its method but does not include structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The code is at: https://github.com/microsoft/Unicoder/tree/master/MCUL. |
| Open Datasets | Yes | In the first stage, we pre-train the encoder with a multilingual MLM objective on 15 languages of XNLI. The pre-training corpus is CC-Net (Wenzek et al. 2020). In the second stage, we train our model on bilingual data OPUS-100 (Zhang et al. 2020). |
| Dataset Splits | No | The paper mentions using 'English training data' for some tasks and pre-training/fine-tuning, but it does not explicitly provide details about train/validation/test splits (e.g., percentages or sample counts for each split) for the datasets used. |
| Hardware Specification | No | The paper mentions 'Limited by resources' and 'GPU memory usage' but does not specify any particular GPU or CPU models, or other hardware components used for experiments. |
| Software Dependencies | No | The paper does not provide specific software dependencies with version numbers (e.g., Python 3.x, PyTorch 1.x, CUDA 11.x). |
| Experiment Setup | Yes | Limited by resources, we pre-train the model for 500K steps with a batch size of 8192, which is less than XLM-R Base. The hyper-parameters in pre-training and finetuning are the same as those of natural language. The size of the universal vocabulary K is set to 60K. |