XLM-K: Improving Cross-Lingual Language Model Pre-training with Multilingual Knowledge

Authors: Xiaoze Jiang, Yaobo Liang, Weizhu Chen, Nan Duan10840-10848

AAAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate XLM-K on MLQA, NER and XNLI. Experimental results clearly demonstrate significant improvements over existing multilingual language models. The results on MLQA and NER exhibit the superiority of XLM-K in knowledge related tasks. The success in XNLI shows a better crosslingual transferability obtained in XLM-K. What is more, we provide a detailed probing analysis to confirm the desired knowledge captured in our pre-training regimen.
Researcher Affiliation Collaboration Xiaoze Jiang1*, Yaobo Liang2, Weizhu Chen3, Nan Duan2 1Beihang University, Beijing, China 2Microsoft Research Asia, Beijing, China 3Microsoft Azure AI, Redmond, WA, USA xzjiang@buaa.edu.cn, {yalia, wzchen, nanduan}@microsoft.com
Pseudocode No The paper describes the model architecture and tasks but does not include any formal pseudocode or algorithm blocks.
Open Source Code Yes The code is available at https://github.com/microsoft/Unicoder/ tree/master/pretraining/xlmk.
Open Datasets Yes For the multilingual masked language modeling task, we use Common Crawl dataset (Wenzek et al. 2020). The Common Crawl dataset is crawled from the whole web without restriction, which contains all the corpus from the Wikipedia. For the proposed two tasks, we use the corpus for the top 100 languages with the largest Wikipedias. ... MLQA (Lewis et al. 2020) is a multilingual question answering dataset... The cross-lingual NER (Liang et al. 2020) dataset covers 4 languages... The XNLI (Conneau et al. 2018) is a popular evaluation dataset for cross-lingual NLI which contains 15 languages.
Dataset Splits Yes We test all the fine-tuned models on dev split of all languages for each fine-tuning epoch and select the model based on the best average performance on the dev split of all languages.
Hardware Specification Yes The pre-training experiments are conducted using 16 V100 GPUs.
Software Dependencies No The paper mentions using Adam optimizer but does not specify versions for any other software libraries or dependencies, such as PyTorch, TensorFlow, or Python versions.
Experiment Setup Yes The architecture of XLM-K is set as follows: 768 hidden units, 12 heads, 12 layers, GELU activation, a dropout rate of 0.1, with a maximal input length of 256 for the proposed knowledge tasks, and 512 for MLM task. ... We initialize the model with XLMRbase (Conneau et al. 2020), and conduct continual pre-training with the gradient accumulation of 8,192 batch size. We utilize Adam (Kingma and Ba 2015) as our optimizer. The learning rate starts with 10k warm-up steps and the peak learning rate is set to 3e-5. The size of candidate list size N = 32k. ... For MLQA, we fine-tune 2 epochs, with the learning rate set as 3e-5 and batch size of 12. For NER, we fine-tune 20 epochs, with the learning rate set as 5e-6 and batch size of 32. For XNLI, we fine-tune 10 epochs and the other settings are the same as for NER.