Cross-lingual Language Model Pretraining
Authors: Alexis CONNEAU, Guillaume Lample
NeurIPS 2019 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We obtain state-of-the-art results on cross-lingual classification, unsupervised and supervised machine translation. On XNLI, our approach pushes the state of the art by an absolute gain of 4.9% accuracy. On unsupervised machine translation, we obtain 34.3 BLEU on WMT 16 German-English, improving the previous state of the art by more than 9 BLEU. On supervised machine translation, we obtain a new state of the art of 38.5 BLEU on WMT 16 Romanian-English, outperforming the previous best approach by more than 4 BLEU. Our code and pretrained models are publicly available1. In this section, we empirically demonstrate the strong impact of cross-lingual language model pretraining on several benchmarks, and compare our approach to the current state of the art. |
| Researcher Affiliation | Collaboration | Alexis Conneau Facebook AI Research Université Le Mans aconneau@fb.com Guillaume Lample Facebook AI Research Sorbonne Universités glample@fb.com |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks (clearly labeled algorithm sections or code-like formatted procedures). |
| Open Source Code | Yes | Our code and pretrained models are publicly available1. 1https://github.com/facebookresearch/XLM |
| Open Datasets | Yes | We use Wiki Extractor to extract raw sentences from Wikipedia dumps and use them as monolingual data for the CLM and MLM objectives. For the TLM objective, we only use parallel data that involves English, similar to Conneau et al. [12]. Precisely, we use Multi UN [44] for French, Spanish, Russian, Arabic and Chinese, and the IIT Bombay corpus [3] for Hindi. We extract the following corpora from the OPUS website Tiedemann [37]: the EUbookshop corpus for German, Greek and Bulgarian, Open Subtitles 2018 for Turkish, Vietnamese and Thai, Tanzil for both Urdu and Swahili and Global Voices for Swahili. |
| Dataset Splits | No | When fine-tuning on XNLI, we use mini-batches of size 8 or 16, and we clip the sentence length to 256 words. We use 80k BPE splits and a vocabulary of 95k and train a 12-layer model on the Wikipedias of the XNLI languages. We sample the learning rate of the Adam optimizer with values from 5.10 4 to 2.10 4, and use small evaluation epochs of 20000 random samples. The paper refers to training, validation (implicitly through 'small evaluation epochs' and 'dev and test sets' for XNLI baselines), and test sets, but does not provide explicit numerical splits for training/test/validation for its own models on XNLI (e.g., percentages or exact counts for each split). |
| Hardware Specification | Yes | We implement all our models in Py Torch [29], and train them on 64 Volta GPUs for the language modeling tasks, and 8 GPUs for the MT tasks. We use float16 operations to speed up training and to reduce the memory usage of our models. |
| Software Dependencies | No | We implement all our models in Py Torch [29]... The paper mentions PyTorch, but does not provide specific version numbers for PyTorch or any other software dependencies, which is required for reproducibility. |
| Experiment Setup | Yes | In all experiments, we use a Transformer architecture with 1024 hidden units, 8 heads, GELU activations [17], a dropout rate of 0.1 and learned positional embeddings. We train our models with the Adam optimizer [23], a linear warm-up [38] and learning rates varying from 10 4 to 5.10 4. For the CLM and MLM objectives, we use streams of 256 tokens and a mini-batches of size 64. For the TLM objective, we sample mini-batches of 4000 tokens composed of sentences with similar lengths. When fine-tuning on XNLI, we use mini-batches of size 8 or 16, and we clip the sentence length to 256 words. We use 80k BPE splits and a vocabulary of 95k. |