SCoMoE: Efficient Mixtures of Experts with Structured Communication
Authors: zhiyuan zeng, Deyi Xiong
ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experiments on bilingual and massively multilingual machine translation demonstrate that SCo Mo E achieves a speedup of 1.44x over GShard with comparable performance, and substantially outperforms Gshard (2.8 BLEU) on OPUS-100 with a speedup of 1.25x. |
| Researcher Affiliation | Academia | Zhiyuan Zeng Tianjin University zhiyuan zeng@tju.edu.cn Deyi Xiong Tianjin University, GTCOM dyxiong@tju.edu.cn |
| Pseudocode | No | The paper does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | Codes are available at https://github.com/Zhi Yuan Zeng/ fairseq-moe. |
| Open Datasets | Yes | In bilingual NMT experiments, we used the WMT17-En-Fr corpus, which consists of 35,762,532 training sentences, while in the massively multilingual NMT experiments, the OPUS100 corpus (Zhang et al., 2020) was used, which consists of 100 languages and 107,924,846 training examples. |
| Dataset Splits | Yes | The checkpoint with the best performance on the validation set was selected for testing. In the bilingual experiments, we evaluated the performance on the validation set with BLEU. In the multilingual experiments, the perplexity (ppl) on the validation set was used for faster validation. |
| Hardware Specification | Yes | The base models have 6 encoder/decoder layers, 8 experts and dmodel = 512, which were trained on one node with 8 RTX A6000 GPUs. The large models have 12 encoder/decoder layers, 32 experts, and dmodel = 1024, which were trained on 4 nodes with 32 RTX A6000 GPUs. |
| Software Dependencies | No | The paper mentions fairseq and Adam optimizer but does not provide specific version numbers for these software dependencies, only a link to the fairseq MoE branch. |
| Experiment Setup | Yes | The Adam optimizer (Kingma & Ba, 2015a) with learning rate = 5e 4, α = 0.9, β = 0.98, was used to optimize our models. We employed the inverse-square-root learning schedule (Kingma & Ba, 2015b), with the number of warm-up step being set to 4000. The batch size was set to be 4096 tokens. The dropout rate was 0.3 for bilingual experiments and 0.1 for multilingual experiments. Both the base and large models were trained for 10 epochs. The checkpoint with the best performance on the validation set was selected for testing. |