Alternating Language Modeling for Cross-Lingual Pre-Training
Authors: Jian Yang, Shuming Ma, Dongdong Zhang, ShuangZhi Wu, Zhoujun Li, Ming Zhou9386-9393
AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate our pre-training ALM on the downstream tasks of machine translation and cross-lingual classification. Experiments show that ALM can outperform the previous pretraining methods on three benchmarks. |
| Researcher Affiliation | Collaboration | Jian Yang,1 Shuming Ma,2 Dongdong Zhang,2 Shuang Zhi Wu,3 Zhoujun Li,1 Ming Zhou2 1State Key Lab of Software Development Environment, Beihang University 2Microsoft Research Asia 3SPPD of Tencent Inc. |
| Pseudocode | No | The paper does not contain any pseudocode or clearly labeled algorithm blocks. |
| Open Source Code | Yes | Code can be found at https://github.com/zddfunseeker/ALM. |
| Open Datasets | Yes | Following previous work (Lample and Conneau 2019), we use Wikipedia data by using Wiki Extractor and WMT data as monolingual data. For bilingual data, French, Spanish, Russian, Arabic, and Chinese data are from Multi UN (Ziemski, Junczys-Dowmunt, and Pouliquen 2016). Hindi data is from the IIT Bombay corpus (Kunchukuttan, Mehta, and Bhattacharyya 2018). German and Greek are from the EUbookshop corpus. Turkish, Vietnamese and Thai are from Open Subtitles 2018. Urdu and Swahili data are from Tanzil. Swahili data is from Global Voices. For most languages, we use the tokenizer provided by Moses (Koehn et al. 2007). |
| Dataset Splits | Yes | WMT14 English German machine translation dataset has 4.5 million sentence pairs for training. newsdev2014 is used as the validation set, while the newstest2014 is the testing set. IWSLT14 German-English machine translation dataset contains 160 thousand sentence pairs. They are collected from TED talks. We use iwslt14 devset as the validation set and the iwslt14 testset as the testing set. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments. |
| Software Dependencies | No | The paper mentions techniques and algorithms like 'byte pair encoding (BPE)' and 'Adam optimizer', but it does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow 2.x, PyTorch 1.x) that were used. |
| Experiment Setup | Yes | We pre-train our model with both 1024 embedding and hidden units, 8 heads, a dropout rate of 0.1 and learned positional embeddings. We use an Adam optimizer with parameters of β1 = 0.9 and β2 = 0.98. We set the inverse sqrt learning rate schedule with a linear warmup where the number of warmup step is 4000 and a learning rate of 0.0005. We tune the learning rates based on the performance on the validation set, and the learning rates are 5 10 4 for IWSLT14 German-English and 10 3 for WMT14 English German. The batch size is set to 8192 tokens for all experiments. During decoding, we set the beam size to 8. |