High-resource Language-specific Training for Multilingual Neural Machine Translation

Authors: Jian Yang, Yuwei Yin, Shuming Ma, Dongdong Zhang, Zhoujun Li, Furu Wei

IJCAI 2022 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results show that HLT-MT outperforms various strong baselines on WMT-10 and OPUS-100 benchmarks. Furthermore, the analytic experiments validate the effectiveness of our method in mitigating the negative interference in multilingual training.
Researcher Affiliation Collaboration Jian Yang1 , Yuwei Yin2 *, Shuming Ma2, Dongdong Zhang2, Zhoujun Li1 , Furu Wei2 1State Key Lab of Software Development Environment, Beihang University 2Microsoft Research
Pseudocode No The paper does not contain any pseudocode or algorithm blocks.
Open Source Code No The paper does not contain any explicit statements or links indicating that the source code for the described methodology is open-source or publicly available.
Open Datasets Yes To evaluate our method, we conduct experiments on the WMT-10 and the OPUS-100 dataset. WMT-10 We use a collection of parallel data in different languages from the WMT datasets to evaluate the models [Wang et al., 2020a], The parallel data is between English and other 10 languages, including French (Fr), Czech (Cs), German (De), Finnish (Fi), Latvian (Lv), Estonian (Et), Romanian (Ro), Hindi (Hi), Turkish (Tr) and Gujarati (Gu). OPUS-100 We use the OPUS-100 corpus [Zhang et al., 2020] for massively multilingual machine translation.
Dataset Splits No The paper mentions training on datasets but does not explicitly detail the exact training, validation, and test splits with specific percentages or counts.
Hardware Specification Yes The batch size is 4096 tokens on 64 Tesla V100 GPUs.
Software Dependencies No The paper mentions using Adam optimizer and Transformer as backbone, but does not provide specific version numbers for these or other software dependencies.
Experiment Setup Yes We adopt Transformer as the backbone model for all experiments. We train multilingual models with Adam (β1 = 0.9, β2 = 0.98). The learning rate is set as 5e-4 with a warm-up step of 4,000. The models are trained with the label smoothing cross-entropy with a smoothing ratio of 0.1. The batch size is 4096 tokens on 64 Tesla V100 GPUs. For WMT-10, we first train the multilingual model with 6 languages and then finetunes on all languages. For OPUS-100, the model is trained in the languages where the number of pairs exceeds 10K.