Alternating Language Modeling for Cross-Lingual Pre-Training

Authors: Jian Yang, Shuming Ma, Dongdong Zhang, ShuangZhi Wu, Zhoujun Li, Ming Zhou9386-9393

AAAI 2020 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate our pre-training ALM on the downstream tasks of machine translation and cross-lingual classification. Experiments show that ALM can outperform the previous pretraining methods on three benchmarks.
Researcher Affiliation Collaboration Jian Yang,1 Shuming Ma,2 Dongdong Zhang,2 Shuang Zhi Wu,3 Zhoujun Li,1 Ming Zhou2 1State Key Lab of Software Development Environment, Beihang University 2Microsoft Research Asia 3SPPD of Tencent Inc.
Pseudocode No The paper does not contain any pseudocode or clearly labeled algorithm blocks.
Open Source Code Yes Code can be found at https://github.com/zddfunseeker/ALM.
Open Datasets Yes Following previous work (Lample and Conneau 2019), we use Wikipedia data by using Wiki Extractor and WMT data as monolingual data. For bilingual data, French, Spanish, Russian, Arabic, and Chinese data are from Multi UN (Ziemski, Junczys-Dowmunt, and Pouliquen 2016). Hindi data is from the IIT Bombay corpus (Kunchukuttan, Mehta, and Bhattacharyya 2018). German and Greek are from the EUbookshop corpus. Turkish, Vietnamese and Thai are from Open Subtitles 2018. Urdu and Swahili data are from Tanzil. Swahili data is from Global Voices. For most languages, we use the tokenizer provided by Moses (Koehn et al. 2007).
Dataset Splits Yes WMT14 English German machine translation dataset has 4.5 million sentence pairs for training. newsdev2014 is used as the validation set, while the newstest2014 is the testing set. IWSLT14 German-English machine translation dataset contains 160 thousand sentence pairs. They are collected from TED talks. We use iwslt14 devset as the validation set and the iwslt14 testset as the testing set.
Hardware Specification No The paper does not provide specific hardware details (e.g., GPU/CPU models, memory) used for running its experiments.
Software Dependencies No The paper mentions techniques and algorithms like 'byte pair encoding (BPE)' and 'Adam optimizer', but it does not specify any software libraries or frameworks with version numbers (e.g., TensorFlow 2.x, PyTorch 1.x) that were used.
Experiment Setup Yes We pre-train our model with both 1024 embedding and hidden units, 8 heads, a dropout rate of 0.1 and learned positional embeddings. We use an Adam optimizer with parameters of β1 = 0.9 and β2 = 0.98. We set the inverse sqrt learning rate schedule with a linear warmup where the number of warmup step is 4000 and a learning rate of 0.0005. We tune the learning rates based on the performance on the validation set, and the learning rates are 5 10 4 for IWSLT14 German-English and 10 3 for WMT14 English German. The batch size is set to 8192 tokens for all experiments. During decoding, we set the beam size to 8.