reproducibilityindex.ai

SCoMoE: Efficient Mixtures of Experts with Structured Communication

Authors: zhiyuan zeng, Deyi Xiong

ICLR 2023 | Conference PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on bilingual and massively multilingual machine translation demonstrate that SCo Mo E achieves a speedup of 1.44x over GShard with comparable performance, and substantially outperforms Gshard (2.8 BLEU) on OPUS-100 with a speedup of 1.25x.
Researcher Affiliation	Academia	Zhiyuan Zeng Tianjin University zhiyuan zeng@tju.edu.cn Deyi Xiong Tianjin University, GTCOM dyxiong@tju.edu.cn
Pseudocode	No	The paper does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	Codes are available at https://github.com/Zhi Yuan Zeng/ fairseq-moe.
Open Datasets	Yes	In bilingual NMT experiments, we used the WMT17-En-Fr corpus, which consists of 35,762,532 training sentences, while in the massively multilingual NMT experiments, the OPUS100 corpus (Zhang et al., 2020) was used, which consists of 100 languages and 107,924,846 training examples.
Dataset Splits	Yes	The checkpoint with the best performance on the validation set was selected for testing. In the bilingual experiments, we evaluated the performance on the validation set with BLEU. In the multilingual experiments, the perplexity (ppl) on the validation set was used for faster validation.
Hardware Specification	Yes	The base models have 6 encoder/decoder layers, 8 experts and dmodel = 512, which were trained on one node with 8 RTX A6000 GPUs. The large models have 12 encoder/decoder layers, 32 experts, and dmodel = 1024, which were trained on 4 nodes with 32 RTX A6000 GPUs.
Software Dependencies	No	The paper mentions fairseq and Adam optimizer but does not provide specific version numbers for these software dependencies, only a link to the fairseq MoE branch.
Experiment Setup	Yes	The Adam optimizer (Kingma & Ba, 2015a) with learning rate = 5e 4, α = 0.9, β = 0.98, was used to optimize our models. We employed the inverse-square-root learning schedule (Kingma & Ba, 2015b), with the number of warm-up step being set to 4000. The batch size was set to be 4096 tokens. The dropout rate was 0.3 for bilingual experiments and 0.1 for multilingual experiments. Both the base and large models were trained for 10 epochs. The checkpoint with the best performance on the validation set was selected for testing.