Sparse MoE with Language Guided Routing for Multilingual Machine Translation
Authors: Xinyu Zhao, Xuxi Chen, Yu Cheng, Tianlong Chen
ICLR 2024 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Sufficient experimental studies on MMT benchmarks with {16, 50, 100} languages and various network architectures, consistently validate the superior performance of our proposals. |
| Researcher Affiliation | Academia | 1The University of North Carolina at Chapel Hill 2The University of Texas at Austin 3The Chinese University of Hong Kong 4MIT 5Harvard University |
| Pseudocode | Yes | Algorithm 1 DEA in our proposed Lingual-SMo E. |
| Open Source Code | Yes | 1Our code is provided at https://github.com/UNITES-Lab/Lingual-SMo E. |
| Open Datasets | Yes | We evaluate the proposed Lingual-SMo E on the representative multilingual neural machine translation dataset, i.e., OPUS-100 (Zhang et al., 2020) that contains 100 languages and 94 validation and test language pairs. |
| Dataset Splits | Yes | We evaluate the proposed Lingual-SMo E on the representative multilingual neural machine translation dataset, i.e., OPUS-100 (Zhang et al., 2020) that contains 100 languages and 94 validation and test language pairs. (...) We split the 94 validation language pairs in OPUS-100 into three groups based on their training data size: high-resource (> 0.9M, 45 languages), low-resource (< 0.1M, 26 languages), and medium-resource (other, 28 languages) (Zhang et al., 2020). (...) Table A6: The statistics of the OPUS-100 datasets and its sub-datasets. Datasets ... Train Validation Test |
| Hardware Specification | Yes | Experiments are conducted using Fairseq (Ott et al., 2019) with 8 RTX A6000 GPUs. |
| Software Dependencies | Yes | BLEU Signature: nrefs:1 | case:mixed | eff:no | tok:13a | smooth:exp | version:2.3.1 |
| Experiment Setup | Yes | The training processes have 35K, 100K, and 200K iterations for OPUS-16, OPUS-50, and OPUS-100, respectively. With a learning rate of 5e-4, we optimize models with Adam using (β1, β2, ϵ) = (0.9, 0.98, 10e-8) (Kingma & Ba, 2015). The learning rate schedule follows the Inverse Square Root with a specific number of warm-up steps set to 4,000. A temperature-based data sampling strategy is utilized to train our models (Aharoni et al., 2019). The temperature is set to 1.5 for OPUS-16, and 5 for OPUS-50 and OPUS-100. The dynamic expert allocation uses a value of n equal to 5,000 iterations for experiments on OPUS-16, OPUS-50, and 10,000 iterations for OPUS-100. In addition, the ratio of expert number exploring updates is set to 0.8, and the threshold controlling expert capacity number λ is 0.1 for OPUS-16, OPUS-50 and 0.01 for OPUS-100. |