switch-GLAT: Multilingual Parallel Machine Translation Via Code-Switch Decoder
Authors: Zhenqiao Song, Hao Zhou, Lihua Qian, Jingjing Xu, Shanbo Cheng, Mingxuan Wang, Lei Li
ICLR 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments show that our proposed switch-GLAT outperform the multilingual Transformer with as much as 0.74 BLEU improvement and 6.2x faster decoding speed in inference. We conduct extensive experiments on 3 merged translation datasets |
| Researcher Affiliation | Collaboration | 1Byte Dance AI Lab, Shanghai, China 2University of California, Santa Barbara {songzhenqiao,zhouhao.nlp,qianlihua}@bytedance.com, lilei@ucsb.edu {chengshanbo,wangmingxuan.89}@bytedance.com, jingjingxu@pku.edu.cn |
| Pseudocode | No | The paper does not contain any structured pseudocode or algorithm blocks (e.g., a figure or section explicitly labeled 'Pseudocode' or 'Algorithm'). |
| Open Source Code | No | The paper does not include any explicit statement or link indicating that the source code for the described methodology is publicly available. |
| Open Datasets | Yes | WMT-EDF: We collect 4 language pairs from WMT-14 English (En) German (De) and English (En) French (Fr). WMT-EFZ: We also collect 4 language pairs from WMT-14 English (En) French (Fr) and WMT-17 English (En) Chinese (Zh). WMT-many: We also gather 10 language pairs from WMT-14 English (En) German (De), English (En) French (Fr), WMT-16 English (En) Russian (Ru) and WMT-17 English (En) Chinese (Zh) to test switch-GLAT on more diverse language pairs. |
| Dataset Splits | Yes | We conduct extensive experiments on 3 merged translation datasets: WMT with four language pairs (both close languages and distant ones) and WMT with 10 language pairs. We collect 4 language pairs from WMT-14 English (En) German (De) and English (En) French (Fr). |
| Hardware Specification | Yes | The model is trained with 8 NVIDIA Tesla V100 GPU cards. |
| Software Dependencies | No | The paper mentions software components like 'Adam optimizer' and 'BPE encodings' and refers to their original papers, but it does not specify version numbers for these software packages or any other programming languages/libraries used (e.g., Python, PyTorch). |
| Experiment Setup | Yes | We use 6 layers for encoder and parallel decoder. The model hidden size dmodel and feed-forward hidden size dff are set to 512 and 2048 respectively. The number of attention head is set to 8. The vocabulary size is set to 85k for WMT-EDF and 95k for WMT-EFZ/many. The changing point E is set to 300,000 steps and sampling number S is set to 300,000 for each pair. The mini-batch size is set to 64k tokens and the maximum training step is 1,200,000. We follow the default parameters of Adam optimizer (Kingma & Ba, 2014) and learning rate schedule in Vaswani et al. (2017). Dropout annealing strategy (Rennie et al., 2015) is applied to stable training and the initialized dropout rate is set to 0.3. In training, data from different language pairs are sampled according to a multinomial distribution rebalanced by a temperature of 0.3 (Conneau et al., 2019). |