switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder

Authors: Zhenqiao Song, Hao Zhou, Lihua Qian, Jingjing Xu, Shanbo Cheng, Mingxuan Wang, Lei Li

ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experiments show that our proposed switch-GLAT outperform the multilingual Transformer with as much as 0.74 BLEU improvement and 6.2x faster decoding speed in inference. We conduct extensive experiments on 3 merged translation datasets
Researcher Affiliation Collaboration 1Byte Dance AI Lab, Shanghai, China 2University of California, Santa Barbara {songzhenqiao,zhouhao.nlp,qianlihua}@bytedance.com, lilei@ucsb.edu {chengshanbo,wangmingxuan.89}@bytedance.com, jingjingxu@pku.edu.cn
Pseudocode No The paper does not contain any structured pseudocode or algorithm blocks (e.g., a figure or section explicitly labeled 'Pseudocode' or 'Algorithm').
Open Source Code No The paper does not include any explicit statement or link indicating that the source code for the described methodology is publicly available.
Open Datasets Yes WMT-EDF: We collect 4 language pairs from WMT-14 English (En) German (De) and English (En) French (Fr). WMT-EFZ: We also collect 4 language pairs from WMT-14 English (En) French (Fr) and WMT-17 English (En) Chinese (Zh). WMT-many: We also gather 10 language pairs from WMT-14 English (En) German (De), English (En) French (Fr), WMT-16 English (En) Russian (Ru) and WMT-17 English (En) Chinese (Zh) to test switch-GLAT on more diverse language pairs.
Dataset Splits Yes We conduct extensive experiments on 3 merged translation datasets: WMT with four language pairs (both close languages and distant ones) and WMT with 10 language pairs. We collect 4 language pairs from WMT-14 English (En) German (De) and English (En) French (Fr).
Hardware Specification Yes The model is trained with 8 NVIDIA Tesla V100 GPU cards.
Software Dependencies No The paper mentions software components like 'Adam optimizer' and 'BPE encodings' and refers to their original papers, but it does not specify version numbers for these software packages or any other programming languages/libraries used (e.g., Python, PyTorch).
Experiment Setup Yes We use 6 layers for encoder and parallel decoder. The model hidden size dmodel and feed-forward hidden size dff are set to 512 and 2048 respectively. The number of attention head is set to 8. The vocabulary size is set to 85k for WMT-EDF and 95k for WMT-EFZ/many. The changing point E is set to 300,000 steps and sampling number S is set to 300,000 for each pair. The mini-batch size is set to 64k tokens and the maximum training step is 1,200,000. We follow the default parameters of Adam optimizer (Kingma & Ba, 2014) and learning rate schedule in Vaswani et al. (2017). Dropout annealing strategy (Rennie et al., 2015) is applied to stable training and the initialized dropout rate is set to 0.3. In training, data from different language pairs are sampled according to a multinomial distribution rebalanced by a temperature of 0.3 (Conneau et al., 2019).