Sparse MoE with Language Guided Routing for Multilingual Machine Translation

Authors: Xinyu Zhao, Xuxi Chen, Yu Cheng, Tianlong Chen

ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Sufficient experimental studies on MMT benchmarks with {16, 50, 100} languages and various network architectures, consistently validate the superior performance of our proposals.
Researcher Affiliation Academia 1The University of North Carolina at Chapel Hill 2The University of Texas at Austin 3The Chinese University of Hong Kong 4MIT 5Harvard University
Pseudocode Yes Algorithm 1 DEA in our proposed Lingual-SMo E.
Open Source Code Yes 1Our code is provided at https://github.com/UNITES-Lab/Lingual-SMo E.
Open Datasets Yes We evaluate the proposed Lingual-SMo E on the representative multilingual neural machine translation dataset, i.e., OPUS-100 (Zhang et al., 2020) that contains 100 languages and 94 validation and test language pairs.
Dataset Splits Yes We evaluate the proposed Lingual-SMo E on the representative multilingual neural machine translation dataset, i.e., OPUS-100 (Zhang et al., 2020) that contains 100 languages and 94 validation and test language pairs. (...) We split the 94 validation language pairs in OPUS-100 into three groups based on their training data size: high-resource (> 0.9M, 45 languages), low-resource (< 0.1M, 26 languages), and medium-resource (other, 28 languages) (Zhang et al., 2020). (...) Table A6: The statistics of the OPUS-100 datasets and its sub-datasets. Datasets ... Train Validation Test
Hardware Specification Yes Experiments are conducted using Fairseq (Ott et al., 2019) with 8 RTX A6000 GPUs.
Software Dependencies Yes BLEU Signature: nrefs:1 | case:mixed | eff:no | tok:13a | smooth:exp | version:2.3.1
Experiment Setup Yes The training processes have 35K, 100K, and 200K iterations for OPUS-16, OPUS-50, and OPUS-100, respectively. With a learning rate of 5e-4, we optimize models with Adam using (β1, β2, ϵ) = (0.9, 0.98, 10e-8) (Kingma & Ba, 2015). The learning rate schedule follows the Inverse Square Root with a specific number of warm-up steps set to 4,000. A temperature-based data sampling strategy is utilized to train our models (Aharoni et al., 2019). The temperature is set to 1.5 for OPUS-16, and 5 for OPUS-50 and OPUS-100. The dynamic expert allocation uses a value of n equal to 5,000 iterations for experiments on OPUS-16, OPUS-50, and 10,000 iterations for OPUS-100. In addition, the ratio of expert number exploring updates is set to 0.8, and the threshold controlling expert capacity number λ is 0.1 for OPUS-16, OPUS-50 and 0.01 for OPUS-100.